The last time Hackerfall tried to access this page, it returned a not found error.
A cached version of the page is below, or click here to continue anyway

While Chris has done an excellent job explaining this concept, I’m having too much fun with my coin example to stop now.

Suppose I try my hand at casting my own coin. If my work is even half as good, my coin is going to end up being rather uneven and most likely biased.

Unlike the double headed coin problem, I know very little about the coin’s bias. One (not particularly good) choice for a prior is to assign equal probability to all possibilities.

In my last post, $\theta$ could only be one of two values and the prior reflected that. The height of a bar represented the probability mass for each value of $\theta$ and the sum of all heights was 1.

In this case $\theta$ can take any value between 0 and 1. It doesn’t make sense to talk about the probability for a specific value of $\theta$, the probability at any one point is infinitesimal.

Here the prior is plotted as a probability density function, I can calculate the probability mass between any two points $(a, b)$ by integrating $\mathrm{pdf}(\theta)$ over $a$ and $b$

Since this is a probability density function, $\int_{0}^{1} \mathrm{pdf}(\theta)\cdot d\theta = 1$

The binomial distribution is a function which plots the probability of any independent binary event (like a coin flip) succeeding for a number of attempts.

If a binary event has a probability $\theta$ of succeeding and succeeds *k* times out of *N*

$$\mathrm{P}(\textrm{k successes, N attempts} | \theta) = \binom{N}{k}\cdot\theta^{k}\cdot(1-\theta)^{N-k}$$

Imagine I flip my coin **10 times** and end up with only **2 heads**.

For an unbiased coin, this is fairly unlikely

$$\mathrm{P}(\textrm{2 successes, 10 attempts} | \theta=0.5) $$ $$ = \binom{10}{2}\cdot(0.5)^{2}\cdot(1 - 0.5)^{10 - 2} $$ $$ = 0.001 $$

Just like last time with a larger number of flips, the likelihood for most values of $\theta$ approaches 0. With additional flips, the likelihood spread gets smaller with the highest mass at $\frac{k}{N}$.

I can update the posterior using Bayes’ rule

$$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\int_{0}^{1} \mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)\cdot d\theta} $$

Usually I’d have to approximate a solution using numerical integration but there’s a simpler solution for this particular type of problem.

If the **prior** and the **posterior** belong to the same function *family*, it can make computing the posterior much simpler.

For example, if my prior is Gaussian and my likelihood is Gaussian then my posterior will also be Gaussian, i.e. Gaussian functions are conjugate to themselves.

In this case, since my likelihood function is a binomial distribution, its conjugate prior is a beta distribution.

Specifically, if my prior is of the form $\mathrm{beta}(a, b)$, where *a* and *b* are the parameters of the distribution, and the likelihood function is a binomial distribution with *N* attempts and *k* successes, then the posterior would be a beta distribution with the parameters *a + k* and *b + N - k*.

Updating the posterior reduces to simple addition.

The uniform prior is the same as $\mathrm{beta}(1, 1)$

For **2 heads** out of **10 flips** and the **prior** $\mathrm{beta}(1, 1)$

$$ \mathrm{posterior}(\theta) $$ $$ = \mathrm{beta}(a + k, b + N - k) $$ $$ = \mathrm{beta}(1 + 2, 1 + 10 - 2) $$ $$ = \mathrm{beta}(3, 9) $$

Since *pdf* is a function for **probability density** and not **probability mass**, $\mathrm{pdf}(\theta)$ can be greater than 1 as long as the integral over (0, 1) is 1.

It doesn’t make sense to ask for the probability of a particular $\theta$ but its useful to know the probability of a range. Knowing that $P(0.1\lt\theta\lt0.3) = 0.95$ is far better than where I started, especially as the range gets smaller.

This is a credible interval and there are several ways to pick one. I like to pick the highest density interval, the shortest range (or set of ranges) that contains a certain amount of probability mass (0.95 for example).

For a unimodal distribution, like a beta distribution, there is only 1 highest density interval.

There will always be uncertainty and not making a decision is often not an option. In this situation, I have a few choices. I could pick the mean, which is the expected value of $\theta$, or the mode, which is the most likely value of the distribution (the peak).

a b mode mean variance 1 2 0 0.33 0.05 2 5 0.2 0.29 0.03 3 9 0.2 0.25 0.01Either way, with additional flips the variance drops and both values of central tendency more accurately predict $\theta$

But what about if my $\theta$ changes over time or I have events with multiple possible values and can’t use beta distributions or even numerical integration.

More on that next time.