A primer on two concepts that form the substrate of regression analysis
Few incidents in history exemplify how thoroughly conditional probability is woven into human thought, as remarkably as the events of September 26, 1983.
Just after midnight on September 26, a Soviet Early Warning Satellite flagged a possible ballistic missile launch from the United States directed toward to the Soviet Union. Lt. Col. Stanislav Petrov, the duty officer on shift at a secret EWS control center outside Moscow, received the warning on his screens. He had only minutes to decide whether to flag the signal as legitimate and alert his superior.
The EWS was specifically designed to detect ballistic missile launches. If Petrov had told his boss that the signal was real, the Soviet leadership would have been well within the bounds of reason to interpret it as the start of a nuclear strike on the USSR. To complicate matters, the Cold War had reached a terrifying crescendo in 1983, boosting the probability in the minds of the Soviets that the signal from the EWS was, in fact, the real thing.
Accounts differ on exactly when Petrov informed his superior about the alarm and what information was exchanged between the two men, but two things are certain: Petrov chose to disbelieve what the EWS alarm was implying — namely, that the United States had launched a nuclear missile against the Soviet Union — and Petrov’s superiors deferred to his judgement.
With thousands of nuclear-tipped missiles aimed at each other, neither superpower would have risked retaliatory annihilation by launching only one or a few missiles at the other. The bizarre calculus of nuclear war meant that if either side had to start it, they must do so by launching a tidy portion of their arsenal all at once. No major nuclear power would be so stupid as to start a nuclear war with just a few missiles. Petrov was aware of this doctrine.
Given that the EWS detected a solitary launch, in Petrov’s mind the probability of its being real was vanishingly small despite the acutely heightened tensions of his era.
So Petrov waited. Crucial minutes passed. Soviet ground radars failed to detect any incoming missiles making it almost certain that it was a false alarm. Soon, the EWS fired four more alarms. Adopting the same line of logic, Petrov chose to flag all of them as not genuine. In reality, all alarms turned out to be false.
If Stanislav Petrov had believed the alarms were real, you might not be reading this article today, as I would not have written it.
The 1983 nuclear close call is an unquestionably extreme example of how human beings “compute” probabilities in the face of uncertainty, even without realizing it. Faced with additional evidence, we update our interpretation — our beliefs — about what we’ve observed, and at some point, we act or choose not to act based on those beliefs. This system of “conditioning our beliefs on evidence” plays out in our brain and in our gut every single day, in every walk of life — from a surgeon’s decision to risk operating on a terminally ill cancer patient, to your decision to risk stepping out without an umbrella on a rainy day.
The complex probabilistic machinery that our biological tissue so expertly runs is based upon a surprisingly compact lattice of mathematical concepts. A key piece of this lattice is conditional probability.
In the rest of this article, I’ll cover conditional probability in detail. Specifically, I’ll cover the following:
- The definition of conditional probability
- How to calculate conditional probabilities in a real-life setting
- How to visualize conditional probability
- An introduction to Bayes’ theorem and how conditional probability fits into it.
- How conditional probability underpins the design of every single regression model.
Let’s begin with the definition of conditional probability.
Conditional probability is the probability of event A occurring given that events B, C, D, etc., have already occurred. It is denoted as P(A | B, C, D) or simply P(A | B, C, D).
The notation P(A|B, C, D) is often pronounced as probability of A given B, C, D. Some authors also represent P(A|B, C, D) as P(A|B; C; D).
We assume that events B, C, D jointly influence the probability of A. In other words, event A does not occur independently of events B, C, and D. If event A is independent of events B, C, D, then P(A|B, C, D) equals the unconditional probability of A, namely P(A).
A subtle point to stress here is that when A is conditioned upon multiple other events, the probability of A is influenced by the joint probability of those events. For example, if event A is conditioned upon events B, C, and D, it’s the probability of the event (B ∩ C ∩ D) that A is conditions on.
Thus, P(A | B,C,D) is the same as saying P(A |B ∩ C ∩ D).
We’ll delve into the exact relation of joint probability with conditional probability in the section on Bayes’ theorem. Meanwhile, the thing to remember is that the joint probability P(A ∩ B ∩ C ∩ D) is different from the conditional probability P(A | B ∩ C ∩ D).
Let’s see how to calculate conditional probability.
Every summer, millions of New Yorkers flock to the sun-splashed waters of the city’s 40 or so beaches. With the visitors come the germs, which happily mix and multiply in the warm seawater of the summer months. There are at least a dozen different species and subspecies of bacteria that can contaminate seawater and, if ingested, can cause a considerable amount of, to put it delicately, involuntary expectoration on the part of the beachgoer.
Given this risk to public health, from April through October of each year, the New York City Department of Health and Mental Hygiene (DOHMH) closely monitors the concentration of enterococci bacteria — a key indicator of seawater contamination — in water samples taken from NYC’s many beaches. DOHMH publishes the data it gathers on the NYC OpenData portal.
The following are the contents of the data file pulled down from the portal on 1st July 2024.
The data set contains 27425 samples collected from 40 beaches in the NYC area over a period of nearly 20 years from 2004 to 2024.
Each row in the data set contains the following pieces of information:
- A DOHMH assigned sample ID,
- The date on which the sample was taken,
- The beach at which it was taken,
- The location on the beach (left, center, right) where it was taken,
- The enterococci concentration in MPN (Most Probable Number) per 100 ml of sea water sample, and
- Units (or notes if any) associated with the sample.
Before we learn how to calculate conditional probabilities, let’s see how to calculate the unconditional (prior) probability of an event. We’ll do that by asking the following simple question:
What is the probability that the enterococci concentration in a randomly selected sample from this dataset exceeds 100 MPN per 100 ml of seawater?
Let’s define the problem using the language of statistics.
We’ll define a random variable X to represent the enterococci concentration in a randomly selected sample from the dataset.
Next, we’ll define an event A such that A occurs if, in a randomly selected sample, X exceeds 100 MPN/100 ml.
We wish to find the unconditional probability of event A, namely P(A).
We’ll calculate P(A) as follows:
From the data set of 27425 samples, if you count the samples in which the enterococci concentration exceeds 100, you’ll find this count to be 2414. Thus, P(A) is simply this count divided by the total number of samples, namely 27425.
Now suppose a crucial piece of information flows in to you: The sample was collected on a Monday.
In light of this additional information, can you revise your estimate of the probability that the enterococci concentration in the sample exceeds 100?
In other words, what is the probability of the enterococci concentration in a randomly selected sample exceeding 100, given that the sample was collected on a Monday?
To answer this question, we’ll define a random variable Y to represent the day of the week on which the random sample was examined. The range of Y is [Monday, Tuesday,…,Sunday].
Let B be the event that Y is a Monday.
Recall that A represents the event that X > 100 MPN/100 ml.
Now, we seek the conditional probability P(A | B).
In the dataset, 10670 samples happen to fall on a Monday. Out of these 10670 samples, 700 have an enterococci count exceeding 100. To calculate P(A | B), we divide 700 by 10670. Here, the numerator represents the event A ∩ B (“A and B”), while the denominator represents the event B.
We see that while the unconditional probability of the enterococci concentration in a sample is 0.08802 (8.8%), this probability drops to 6.56% when new evidence is gathered, namely that the samples were all collected on Mondays.
Conditional probability has the nice interpretative quality that the probability of an event can be revised as new pieces of evidence are gathered. This aligns well with our experience of dealing with uncertainty.
Here’s a way to visualize unconditional and conditional probability. Each blue dot in the chart below represents a unique water sample. The chart shows the distribution of enterococci concentrations by the day of the week on which the samples were collected.
The green box contains the entire data set. To calculate P(A), we take the ratio of the number of samples in the green box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the green box.
The orange box contains only those samples that were collected on a Monday.
To calculate P(A | B), we take the ratio of the number of samples in the orange box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the orange box.
Now, let’s make things a bit more interesting. We’ll introduce a third random variable Z. Let Z represent the month in which a random sample is collected. A distribution of enterococci concentrations by month, looks like this:
Suppose you wish to calculate the probability that the enterococci concentration in a randomly selected sample exceeds 100 MPN/100 ml, conditioned upon two events:
- the sample was collected on a Monday, and
- the sample was collected in July.
As before, let A be the event that the enterococci concentration in the sample exceeds 100.
Let B be the event that the sample was collected on a Monday.
Let C be the event that the sample was collected in July (Month 7).
You are now seeking the conditional probability:
P (A | (B ∩ C)), or simply P (A | B, C).
Let’s use the following 3-D plot to aid our understanding of this situation.
The above plot shows the distribution of enterococci concentration plotted against the day of the week and month of the year. As before, each blue dot represents a unique water sample.
The light-yellow plane slices through the subset of samples collected on Mondays i.e. on day of week=0. There happen to be 10670 samples lying along this plane.
The light-red plane slices through the subset of the samples collected in the month of July i.e., month = 7. There are 6122 samples lying along this plane.
The red dotted line marks the intersection of the two planes. There are 2503 samples (marked by the yellow oval) lying along this line of intersection. These 2503 samples were collected on July Mondays.
Among this subset of 2503 samples, are 125 samples in which the enterococci concentration exceeds 100 MPN/100 ml. The ratio of 125 to 2503 is the conditional probability P(A | B, C). The numerator represents the event A ∩ B ∩ C, while the denominator represents the event B ∩ C.
We can easily extend the concept of conditional probability to additional events D, E, F, and so on, although visualizing the additional dimensions lies some distance beyond what is humanly possible.
Now here’s a salient point: As new events occur, the conditional probability doesn’t always systematically decrease (or systematically increase). Instead, as additional evidence is factored in, conditional probability can (and often does) jump up and down in no apparent pattern, also depending on the order in the events are factored into the calculation.
Let’s approach the job of calculating P(A | B) from a slightly different angle, specifically, from a set-theoretic angle.
Let’s denote the entire dataset of 27425 samples as the set S.
Recall that A is the event that the enterococci concentration in a randomly selected sample from S exceeds 100.
From S, if you pull out all samples in which the enterococci concentration exceeds 100, you’ll get a set of size 2414. Let’s denote this set as S_A. As an aside, note that event A occurs for every single sample in S_A.
Recall that B is the event that the sample falls on a Monday. From S, if you pull out all samples collected on a Monday, you’ll get a set of size 10670. Let’s denote this set as S_B.
The intersection of sets S_A and S_B, denoted as S_A ∩ S_B, is a set of 700 samples in which the enterococci concentration exceeds 100 and the sample was collected on a Monday. The following Venn diagram illustrates this situation.
The ratio of the size of S_A ∩ S_B to the size of S is the probability that a randomly selected sample has an enterococci concentration exceeding 100 and was collected on a Monday. This ratio is also known as the joint probability of A and B, denoted P(A ∩ B). Do not mistake the joint probability of A and B for the conditional probability of A given B.
Using set notation, we can calculate the joint probability P(A ∩ B) as follows:
Now consider a different probability: the probability that a sample selected at random from S falls on a Monday. This is the probability of event B. From the overall dataset of 27425 samples, there are 10670 samples that fall on a Monday. We can express P(B) in set notation, as follows:
What I am leading up to with these probabilities is a technique to express P(A | B) using P(B) and the joint probability P(A ∩ B). This technique was first demonstrated by an 18th century English Presbyterian minister named Thomas Bayes (1701–1761) and soon thereafter, in a very different sort of way, by the brilliant French mathematician Pierre-Simon Laplace (1749–1827).
In their endeavors on probability, Bayes (and Laplace) addressed a problem that had vexed mathematicians for several centuries — the problem of inverse probability. Simply put, they sought the solution to the following problem:
Knowing P(A | B), can you calculate P(B | A) as a function of P(A | B)?
While developing a technique for calculating inverse probability, Bayes indirectly proved a theorem that became known as Bayes’ Theorem or Bayes’ rule.
Bayes’ theorem not only allows us to calculate inverse probability, it also enables us to link three fundamental probabilities into a single expression:
- The unconditional (prior) probability P(B) of event B
- The conditional probability P(A | B)
- The joint probability P(A ∩ B) of events A and B
When expressed in modern notation it looks like this:
Plugging in the values of P(A ∩ B) and P(B), we can calculate P(A | B) as follows:
The value 0.06560 for P(A | B) is of course the same as what we arrived at by another method earlier in the article.
Bayes’ Theorem itself is stated as follows:
In the above equation, the conditional probability P(B | A) is expressed in terms of:
- Its inverse P(A | B), and
- The priors P(B) and P(A).
It’s in this form that Bayes’ theorem achieves phenomenal levels of applicability.
In many situations, P(B | A) cannot be easily estimated but its inverse, P(A | B), can be. The unconditional priors P(A) and P(B) can also be estimated via one of two commonly used techniques:
- By direct measurement: We applied this technique to the water samples data. We simply counted the number of samples satisfying, respectively, the events A and B, and each time we divided the respective count by the size of the dataset.
- By invoking the principle of insufficient reason which says that in the absence of additional information suggesting otherwise, we can happily assume that a random variable is uniformly distributed over its range. This merry assumption makes the probability of each value in its range equal to one over the size of its range. Thus, P(X) = 1/N for all values of X. Interestingly, in the early 1800s, Laplace used exactly this principle while deriving his formulae for inverse probability.
The point is, Bayes’ theorem gives you the conditional probability you seek but cannot easily estimate directly, in terms of its inverse probability which you can easily estimate and a couple of priors.
This seemingly simple procedure for calculating conditional probability has turned Bayes’ theorem into a priceless piece of computational machinery.
Bayes’ theorem is used in everything from estimating student performance on standardized test scores to hunting for exoplanets, from diagnosing disease to detecting cyberattacks, from assessing risk of bank failures to predicting outcomes of sporting events. In Law Enforcement, Medicine, Finance, Engineering, Computing, Psychology, Environment, Astronomy, Sports, Entertainment, Education — there is scarcely any field left in which Bayes’ method for calculating conditional probabilities hasn’t been used.
A closer look at joint probability
Let’s return to the joint probability of A and B.
We saw how to calculate P(A ∩ B) using sets as follows:
It’s important to note that whether or not A and B are independent of each, P(A ∩ B) is always the ratio of |S_A ∩ S_B| to |S|.
The numerator in the above ratio is calculated in one of the following two ways depending on whether A is independent of B:
- If A and B are not independent of each other, |S_A ∩ S_B| is literally the count of samples that lie in both sets. If you know this count (which we did in the water quality data), calculating P(A ∩ B) is straightforward using the set based formula shown above.
- If A and B are independent events, then we have the following identities:
Thus, when A and B are independent events, |S_A ∩ S_B| is calculated as follows:
Extension to multiple events
The principle of conditional probability can be extended to any number of events. In the general case, the probability of an event E_s conditioned upon the occurrence of ‘m’ other events E_1 through E_m can be written as follows:
We make the following observations about equation (1):
- The joint probability P(E_1 ∩ E_2 ∩…∩ E_m) in the denominator is assumed to be non-zero. If this probability is zero, implying that events E_1 through E_m cannot occur jointly, the denominator of equation (1) becomes zero, rendering the conditional probability of E_s meaningless. In such a situation, it is useful to express the probability of E_s as independent of events E_1 through E_m, or to find some other set of events on which E_s depends and which can jointly occur.
- In equation (1), the conditional probability P(E_s | E_1 ∩ E_2 ∩…∩ E_m) equals the joint probability P(E_s ∩ E_1 ∩ E_2 ∩…∩ E_m) only when the denominator is a perfect 1.0. i.e. when P( E_1 ∩ E_2 ∩…∩ E_m) is certain to occur.
- In all other situations, the numerator of equation (1) is necessarily smaller than the denominator of equation (1), thereby implying that the conditional probability P(E_s | E_1 ∩ E_2 ∩…∩ E_m) is greater than the joint probability P(E_s ∩ E_1 ∩ E_2 ∩…∩ E_m).
Now here’s something interesting: In equation (1), if you rename the event E_s as ‘y’, and rename events E_1 through E_m as x_1 through X_m respectively, equation (1) suddenly acquires a whole new interpretation.
And that’s the topic of the next section.
There is a triad of concepts upon which every single regression model rests:
- Conditional probability,
- Conditional expectation, and
- Conditional variance
Even within this illustrious trinity, conditional probability commands a preeminent position for two reasons:
- The very choice of the regression model used to estimate the response variable y, is guided by the probability distribution of y.
- Given a probability distribution for y, the probability of observing a particular value of y is conditioned on specific values of the regression variables, also known as the explanatory variables. In other words, the probability of y takes the familiar conditional form: P(y | x_1, x_2,…,x_m).
I’ll illustrate this using two very commonly used, albeit very dissimilar, regression models: the Poisson model, and the linear model.
The role of conditional probability in the Poisson regression model
Consider the task of estimating the daily counts of bicyclists on New York City’s Brooklyn Bridge. This data actually exists: for 7 months during 2017, the NYC Department of Transportation counted the number of bicyclists riding on all East River bridges. The data for the Brooklyn bridge looked like this:
Data such as these, which contain strictly whole-numbered values, can often be effectively modeled using a Poisson process and the Poisson probability distribution. Thus, to estimate the daily count of bicyclists, you would:
- Define a random variable y to represent this daily count, and
- Assume that y is Poisson-distributed.
Consequently, the probability of observing a particular value of y, say y_i, will be given by the following Probability Mass Function of the Poisson probability distribution:
In the above PMF, λ is both the mean and the variance of the Poisson probability distribution.
Now suppose you theorize that the the daily count of bicyclists can be estimated by observing the values of three random variables:
- Minimum temperature on a given day (MinT)
- Maximum Temperature on a given day (MaxT)
- Precipitation on a given day (Precip)
To estimate y as a function of the above three regression variables, you’d want to express the rate parameter λ of the Poisson probability distribution in terms of the three regression variables as follows:
The exponentiation keeps the rate positive. Expressed this way, λ is now a random vector (hence bolded) as its expressed as a function of three random variables.
The above expression for λ, together with the equation for the Poisson PMF of y together constitute the Poisson regression model for estimating y.
Now suppose on a randomly chosen day i, the three random variables take the following values:
MinT = MinT_i,
MaxT = MaxT_i, and
Precip = Precip_i.
Thus, on day i, the Poisson rate λ = λ_i and you can calculate it as follows:
Recall that P(y) is a function of y and λ. Thus, P(y=y_i) is a function of y_i and λ_i. But λ_i is itself a function of MinT_i, MaxT_i, and Precip_i, which implies that P(y=y_i) is also a function of MinT_i, MaxT_i, and Precip_i. The following panel illustrates this relationship.
Thus P(y) is the conditional probability of y on the explanatory variables of the regression model. This conditioning behavior is seen in most models.
The role of conditional probability in the linear regression model
Another common example of regression model is a linear model, where the response variable y is normally distributed. The normal probability distribution is characterized by two parameters: a mean μ and a variance 𝜎². In a linear model with homoskedastic y, we assume that the variance of y is constant across all possible combinations of the explanatory variables of the model. We also assume that the mean μ is a linear combination of the explanatory variables of the linear model. Effectively, the normal probability distribution of y, being a function of μ and 𝜎², turns into a conditional probability distribution of y, conditioned on the explanatory variables of the regression model.
Let’s state this result in general terms.
In a regression model, the probability distribution of the response variable is conditioned on the explanatory variables of the model.
Some authors consider the probability distribution of y to be conditioned on not only the explanatory variables but also the regression coefficients of the model. This is technically correct. In the Poisson regression model we designed for estimating the bicyclist counts, if you look at the PMF of y, you’ll see that it’s a function of β_0, β_1, β_2, and β_3 in addition to the three explanatory variables MinT, MaxT, and Precip.
Shaped as an equation, we may express this behavior in generic terms as follows:
In the regression model y = f(X, β), the conditional probability distribution of y can be expressed as P(y | X, β) = g(X, β) where f and g are some functions of X and β.
In the Poisson regression model, the function g(X, β) is the Poisson probability distribution in which the rate parameter of the distribution is an exponentiated linear combination of all explanatory variables. In the linear regression model, g(X, β) is the Normal distribution’s Probability Density Function where the mean of the distribution is a linear combination of all explanatory variables.
Let’s summarize what we’ve learnt:
- Conditional probability is the probability of an event A occurring, given that events B, C, D, etc. have already occurred. It’s denoted as P(A | B, C, D).
- When event A is conditioned upon multiple other events, its probability is influenced by the joint probability of those events.
- Bayes’ theorem allows us to compute inverse probabilities. Given P(A | B), P(A) and P(B), Bayes’ theorem let’s us calculate P(B | A).
- Bayes’ theorem also links three fundamental probabilities into a single expression: 1) The unconditional (prior) probability P(B), 2) The conditional probability P(A ∣ B), and 3) The joint probability P(A∩B) of events A and B.
- In a regression model, the probability distribution of the response variable is conditioned on the explanatory variables of the model and the model’s coefficients.