Distributions beyond the “Normal”
Welcome to Mathematics for Bayesian Networks. So far, we’ve talked in detail about Bayesian inference and introduced a handful of simple examples that hopefully got you comfortable with the theory side of things. Real-world problems deal with far more complexities, often you’ll have a small dataset with limited information and the only way to move forward is by making informed assumptions.
THE INTUITION
Let’s return to the beer price guessing game mentioned in Part 1 of this series. (Link: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-1-52bdf24829cc). The game’s goal was to guess the price of a random beer can. The Bayesian way of solving involved first making a graph of all possible beer prices sold in the area. In our example, the graph looked something like this,
This graph is based on the assumption that we have sufficient data on beer pricing — which is rarely the case. While dealing with real-world problems,
– For continuous distributions like this, we’ll have to deal with integrals, which implies we’ll need to find a function to describe this data well
– We’ll often have a tiny dataset, just enough to make some assumptions based on the context
This is where distributions come in handy. Distributions are just generalized functions that assign a range of possible values to the variables.
For Bayesian inference we’ll have to select two distributions
– The likelihood function
– The prior distribution
Choose your distributions well because it’ll be the most important decision in this process.
ADDITIONAL TERMINOLOGY
We’ll come across these two sets of terms that might get confusing, so just clearing them up before we proceed,
1) Event vs trial: Performing the experiment is known as a trial and the outcomes of the experiment are known as events
2) Probability Mass Function (PMF) vs Probability Distribution Function (PDF): They’re both the same thing, PMF is for discrete random variables and PDF is for continuous random variables
Here’s a list of some common distribution functions. We’ll first talk about functions for likelihoods followed by functions for priors with a few repetitions here and there.
DISTRIBUTIONS FOR LIKELIHOODS
Bernoulli
Used when we have
– Discrete data
– A single trial
– And only two possible trial outcomes, e.g.- success/failure
Example: Flipping of a single coin
We can use random variable X to assign numerical values of the coin toss. X = 0 for heads and X = 1 for tails. If we use p to define the probability of landing heads,
p(head) = A
p(tail) = 1-A
We use p to define the probability of landing heads, then 1-p is the probability of landing tails.
Binomial
Used when we have
– Discrete data
– Two possible trial outcomes (like Bernoulli)
– Fixed number of independent trials with equal probability of success and failure e.g., 10 tosses of a fair coin
– The overall outcome is the aggregate number of successes
Example: Estimate the number of votes for a particular political party using exit poll data
We use random variable X to assign numerical values for each vote. X = 0 if voted for and X = 1 if voted against. Assuming we have n voters, we assign another random variable Z, 0 ≤ Z ≤n that represents the aggregate of the overall trial:
If we want to find all the possible combinations of outcomes, e.g. for a particular party if we want to find the different possibilities of who voted for whom, we’ll have to get into binomial theory. All the possible outcomes of the trial will be represented by nCr where n represents the number of votes, r represents the number of voters who voted for the party.
If,
p(for)=A
p(against)=1-A
Using the information above, the likelihood would be,
Poisson
Used when we have
– Count of discrete, independent events that occur at a given rate
– Determined amount of space/time in which the events can occur
– Non-negative integer data (like counts)
Example: Estimating the number of car accidents in a city. Here’s the shape of the curve and the PMF,
Let N be the number of trials, n be the number of successes. Then (N-n) is the number of failures.
If A is the probability for success, then for N trials, the probability of success will be v=N/A.
The derivation of this is actually an extension of the binomial distribution. We’ll derive this some other time!
Negative Binomial/Pascal
Used when we have,
– Count of discrete (like Poisson), non-independent (unlike Poisson!) events (e.g. — contagious diseases)
– Determined amount of space/time in which the events can occur
Negative Binomial can model a data generating process where the variance exceeds the mean, i.e., the data is spread out wide or mean is negative. Intuitively a larger variance indicates more variability which could indicate more uncertainty.
Example: Predicting flu cases across large number of equally sized cities. Spread of flu will not have the same probability in different cities and will vary according to social interaction levels.
The expression for PMF includes another distribution that we’ll see later, so let’s skip it for now.
Beta Binomial
Used when we have,
– A fixed count of multiple, discrete, non-independent trials with two possible outcomes for trials– success or failure
– Overall data is measured in average number of successes
– Probability of success varies across trials
– Events that may be independent
Example mosquito lifetime estimation where we capture and mark mosquitoes and then release them back in the wild, We recapture mosquitoes over a period of time and note down the count of recaptured mosquitoes, the decay in count can help estimate lifecycle. A binomial distribution would be used for independent events but assuming mosquito recapture is impacted by the weather conditions hence not independent. Beta Binomial is another distribution which uses multiple parameters and can capture the impact of external factors better than the binomial distribution.
Normal/Gaussian
Used when we have,
– Continuous data
– Unbounded outcomes resulting from a large number of contributing factors
Example: Model distribution of body temperatures in humans. The factors can be environment, temperature, activity, age, weight, metabolism etc. There’s a range but it can be anything within the range.
The Normal distribution is described by its mean µ and standard deviation σ. Changing the mean shifts the curve left or right along the x axis, changing the standard deviation impacts the width of the curve.
The PDF is described as,
This is a very common and very population distribution given how much complexity it can pack with such simplicity!
Student-t
Pretty much similar to Normal distribution with the same conditions but more accommodating towards outliers.
Exponential
This is another useful distribution. It is used when we have
– Continuous non-negative data
– To compute the time space or time between events
Example: Calculate time between new outbreaks of Bird Flu
Exponential distribution depends on one parameter A which characterizes the mean over a certain interval. For the bird-flu case for example, when A increases, the number of outbreaks increases.
The PDF function is defined as,
p(x|A)=Ae-Ax
Gamma
Used when we have,
– Continuous non negative data
– Requirement for more complex model than exponential
Gamma can be used to estimate the time taken for n independent events to occur
Example: Estimate the time taken for the nth build to fail in a factory setup
This is another useful distribution but it requires an entire article dedicated to it. For now, we’ll skip any more details and take it up as and when we encounter it again!
DISTRIBUTIONS FOR PRIOR
Uniform distributions
Used when we have
- Continuous parameters
- Parameters ranging between a and b where a=0 and b=1
For a parameter A, the PDF of uniform distribution is given by,
Beta
Used when we have
– Continuous parameters ranging between 0 and 1
– To include a more range of priors than uniform distribution
Example: Beta prior can be used to model obesity rates in the UK.
Beta is a more flexible version of uniform distribution for parameters that range between 0 and 1. The beta distribution also has the property that it is conjugate to some useful distributions which we’ll talk about in an application-based article.
Logit-Normal
We use it for
– Continuous parameters bounded between 0 and 1
– When we have a requirement for including a wide range of priors
The idea behind the logit-normal is to allow an unconstrained variable to vary according to a normal distribution, then transform it to lie between 0 and 1, using a log transform. Since this involves a normal distribution, we’ll use mean µ and standard deviation σ as descriptors.
The PMF for Logit-Normal is given as,
Dirichlet
We use it for
– Continuous parameters that sum to 1
– When we have a requirement for including a wide range of priors
Dirichlet allows for a large range of prior distributions for individual categories and used for complex modelling tasks.
Normal, Student-t and Gamma
The details of these are already mentioned under the subheading for likelihood so check them out.
Cauchy
It is used when,
– We have continuous and unconstrained parameters
Consider it to be an improved version of Student-t and Normal distributions which allows a much larger range of parameter values. This is again a complex topic that needs more subtopics to be covered before that so we’ll skip any further discussion on it for now.
HOW TO SELECT A DISTRIBUTION
While selecting the appropriate distributions, there are two points to consider
– The selected distribution should satisfy the assumptions we make about the system — e.g., dependent, independent, number of trials, details about the outcomes etc.
– The selected distribution should also be in line with the parameters we define (range of parameters, number of parameters, etc.)
CONCLUSION
I hope this article gives a sneak peek at the large number of distributions that are available for modelling. While it wasn’t possible to go in-depth into most of them, we’ll be picking them up as we work on different applications.
With this, I’d like to conclude the generic mathematical topics under this series. We’ll move on to the next set of topics which will focus on applying all the concepts we’ve covered so far.
REFERENCES
1) Probability and Statistics by DeGroot: https://github.com/muditbac/Reading/blob/master/math/Morris%20H%20DeGroot_%20Mark%20J%20Schervish-Probability%20and%20statistics-Pearson%20Education%20%20(2012).pdf
2) Intuition behind Poisson distribution: https://math.stackexchange.com/questions/836569/what-is-the-intuition-behind-the-poisson-distributions-function
3) Pascal distribution: https://mathworld.wolfram.com/NegativeBinomialDistribution.html
4) Pascal distribution: https://www.cuemath.com/algebra/negative-binomial-distribution/
5) PDF vs PMF: https://byjus.com/maths/probability-mass-function/
6) Normal vs Student-t: https://tjkyner.medium.com/the-normal-distribution-vs-students-t-distribution-322aa12ffd15
7) Log normal vs log logistic: https://home.iitk.ac.in/~kundu/paper146.pdf
8) Introduction to probability theory and its applications by William Feller: https://bitcoinwords.github.io/assets/papers/an-introduction-to-probability-theory-and-its-applications.pdf
9) A student’s guide to Bayesian statistics by Ben Lambert: https://ben-lambert.com/a-students-guide-to-bayesian-statistics/
10) Frequentist vs Bayesian approach in Linear regression: https://fse.studenttheses.ub.rug.nl/25314/1/bMATH_2021_KolkmanODJ.pdf
11) Mathematics for Bayesian Networks — Part 1, Introducing the Basic Terminologies in Probability and Statistics: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-1-52bdf24829cc
12) Mathematics for Bayesian Networks — Part 2, Introduction to Bayes Theorem: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-2-775f11c4d864
13) Mathematics for Bayesian Networks — Part 3, Advanced Concepts and Examples: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-3-eb78a08aa5ae