The Subtle Art of Knowing What You Don’t Know — Part 3: A Closer Look to Likelihood-Prior Relationship
In my first article, we talked about some non-parametric density estimation methods and later found out that it might be computationally expensive to approximate all points.
In my second article, we tackled this problem by using parametric density estimation methods, which — in exchange for additional assumptions about the underlying distribution — offer better generalization and more efficient computation.
In this article, we want to finally learn how to estimate prior distribution given a likelihood estimation, and wrap up everything.
Let’s say we are given a likelihood estimation. A prior distribution is conjugate to a likelihood distribution, if both the prior and posterior follow the same type of distribution, let’s say F-distribution. So, the belief can be updated only by adjusting the parameters of F. The parameters of F might be some statistics regarding the history of belief updates. The input of F are the parameters of the likelihood functions, which are tested.
Binomial and Beta Distribution
In order to digest all this, let’s continue the Bernoulli experiment from the previous article, but this time, we want to do multiple (n) games and count the number of red occurrences. So, the likelihood follows a Binomial distribution with the following density function:
The likelihood of a sequence with k red occurrences is given by the blue part of the formula, and the binomial coefficient in the yellow part tells how many such occurrences exist.
Assume that the likelihood following a Binomial distribution accepts a prior / posterior following a Beta distribution.
- Intuitively, this distribution needs two parameters: a and b, tracking the number of past red and black occurrences respectively. So, if the prior follows Beta(P | a, b), then the posterior after k red and n – k black outcomes must follow Beta(P | a + k, b + (n – k)).
- This distribution must also take the success probability p as input, which is the only variable in the likelihood function to be tested.
As it turns out, the Beta distribution defined below satisfies this condition:
Notice the correspondence between Beta and Binomial distribution? Γ(a) can be regarded as an analytical continuation of (a – 1)!, and B(a, b) seem to be a pseudo-binomial coefficient, which makes a probability distribution by normalizing the term. Also, you might wonder why we take a-1 and b-1 and why not a and b?
Because prior is in fact a way to represent the following information: Given a-1 red and b-1 black occurrences, when does the next red / black occur?
On another note, Beta prior can be used with Bernoulli distribution as well, since it is a special case of Binomial distribution with n = 1. So, a prior can be updated given different types of likelihoods.
Geometric Distribution
Binomial distribution answers how likely k success events happen, given a number of trials n and a success probability p. Can we somehow measure the distance between two success events, given that a variable has a fixed success probability p as in Bernoulli distribution?
Yes, the distribution is called Geometric distribution and it tells how likely the next success event will happen within n trials, given a success probability p. The formula is pretty self-explanatory: