Probability and Statistics (1): Probability Spaces — Why We Need Axioms (But Won't Overdo It)

Building probability from the ground up: sample spaces, Kolmogorov's axioms, conditional probability, Bayes' theorem, and the birthday problem — with proofs and Python simulations.

Every time you check the weather forecast, run an A/B test, or train a neural network, you are standing on a foundation laid in 1933 by a Russian mathematician named Andrey Kolmogorov. Before him, probability was a grab bag of tricks for gamblers and actuaries. After him, it became a branch of mathematics as rigorous as calculus or algebra.

The good news: you don’t need to become a measure theorist to understand modern probability. The axioms are simple. What takes work is building the right intuitions around them — and learning to recognize when those intuitions fail.

This article lays the groundwork. We’ll define precisely what a “probability” is, derive the tools that make it useful (conditional probability, Bayes’ theorem, independence), and end with a classic problem that surprises nearly everyone who encounters it for the first time.


Sample Spaces, Events, and Sigma-Algebras#

Sample space visualization

The Sample Space#

A sample space $\Omega$ is the set of all possible outcomes of a random experiment.

ExperimentSample Space $\Omega$
Flip a coin$\{H, T\}$
Roll a die$\{1, 2, 3, 4, 5, 6\}$
Measure a voltage$\mathbb{R}$ (or $[0, \infty)$ )
Count website hits in an hour$\{0, 1, 2, \ldots\}$

The sample space must be exhaustive (every possible outcome is included) and mutually exclusive (each outcome is a single point in $\Omega$ ).

Events#

An event is a subset of $\Omega$ . If we roll a die, the event “roll an even number” is $A = \{2, 4, 6\}$ . The event “roll something” is $\Omega$ itself (the certain event), and the event “roll a 7 on a standard die” is $\emptyset$ (the impossible event).

Sigma-Algebras (Gently)#

For finite sample spaces, we can assign probabilities to every subset of $\Omega$ . But when $\Omega$ is uncountable (like $\mathbb{R}$ ), pathological subsets exist that cannot be assigned a probability consistently. A sigma-algebra $\mathcal{F}$ is a carefully chosen collection of subsets of $\Omega$ — the ones we are “allowed” to measure.

A sigma-algebra $\mathcal{F}$ must satisfy three properties:

  1. $\Omega \in \mathcal{F}$
  2. If $A \in \mathcal{F}$ , then $A^c \in \mathcal{F}$ (closed under complements)
  3. If $A_1, A_2, \ldots \in \mathcal{F}$ , then $\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}$ (closed under countable unions)

For this series, we will work with the power set (all subsets) when $\Omega$ is finite, and with the Borel sigma-algebra (generated by open intervals) when $\Omega = \mathbb{R}$ . You can safely treat “event” and “subset” as synonyms for everything that follows.

Kolmogorov’s Three Axioms#

A probability measure $P$ on $(\Omega, \mathcal{F})$ is a function $P: \mathcal{F} \to \mathbb{R}$ satisfying:

Kolmogorov axioms

$$P(A) \geq 0.$$ $$P(\Omega) = 1.$$ $$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).$$

That’s it. Three axioms. Everything else — Bayes’ theorem, the law of large numbers, the central limit theorem — is a logical consequence.

Immediate Consequences#

From these axioms alone, we can derive:

$$P(A) + P(A^c) = P(\Omega) = 1 \implies P(A^c) = 1 - P(A).$$

Impossible Event. $P(\emptyset) = 1 - P(\Omega) = 0$ .

$$P(A \cup B) = P(A) + P(B) - P(A \cap B).$$

Proof. Write $A \cup B = A \cup (B \setminus A)$ , where $A$ and $B \setminus A$ are disjoint. Then $P(A \cup B) = P(A) + P(B \setminus A)$ . Now $B = (A \cap B) \cup (B \setminus A)$ , disjoint, so $P(B) = P(A \cap B) + P(B \setminus A)$ , giving $P(B \setminus A) = P(B) - P(A \cap B)$ . Substituting yields the result. $\blacksquare$

Monotonicity. If $A \subseteq B$ , then $P(A) \leq P(B)$ .

Proof. $B = A \cup (B \setminus A)$ , disjoint. So $P(B) = P(A) + P(B \setminus A) \geq P(A)$ by Axiom 1. $\blacksquare$

Conditional Probability#

Probability sample space as a cosmic universe of possible ou

Knowing that some event $B$ has occurred changes our beliefs about other events. This is captured by conditional probability.

Conditional probability

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$

Intuitively, we “zoom in” to the world where $B$ has happened: $B$ becomes our new sample space, and we re-normalize.

The Multiplication Rule#

$$P(A \cap B) = P(A \mid B) \, P(B) = P(B \mid A) \, P(A).$$ $$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \, P(A_2 \mid A_1) \, P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).$$

Example: Drawing Cards#

$$P(\text{Ace}_1 \cap \text{Ace}_2) = P(\text{Ace}_1) \cdot P(\text{Ace}_2 \mid \text{Ace}_1) = \frac{4}{52} \cdot \frac{3}{51} = \frac{12}{2652} = \frac{1}{221} \approx 0.00452.$$

The Law of Total Probability#

Bayes theorem detective updating beliefs with new evidence

$$P(A) = \sum_{i=1}^{n} P(A \mid B_i) \, P(B_i).$$ $$P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) P(B_i). \quad \blacksquare$$

This is enormously useful. When you can’t compute $P(A)$ directly, break the world into cases and handle each one separately.

Bayes’ Theorem#

Theorem (Bayes). If $P(A) > 0$ and $P(B) > 0$ , then

$$P(B \mid A) = \frac{P(A \mid B) \, P(B)}{P(A)}.$$

Proof. From the multiplication rule, $P(A \mid B) P(B) = P(A \cap B) = P(B \mid A) P(A)$ . Divide both sides by $P(A)$ . $\blacksquare$

$$P(B_k \mid A) = \frac{P(A \mid B_k) \, P(B_k)}{\sum_{i=1}^{n} P(A \mid B_i) \, P(B_i)}.$$

Medical Testing Example#

A disease affects 1 in 1000 people. A test detects the disease with 99% sensitivity (true positive rate) and 95% specificity (true negative rate). You test positive. What is the probability you actually have the disease?

Define:

  • $D$ : has the disease, $P(D) = 0.001$
  • $D^c$ : healthy, $P(D^c) = 0.999$
  • $T^+$ : tests positive
  • $P(T^+ \mid D) = 0.99$ (sensitivity)
  • $P(T^+ \mid D^c) = 0.05$ (1 - specificity = false positive rate)
$$P(T^+) = P(T^+ \mid D)P(D) + P(T^+ \mid D^c)P(D^c) = 0.99 \times 0.001 + 0.05 \times 0.999 = 0.00099 + 0.04995 = 0.05094.$$ $$P(D \mid T^+) = \frac{P(T^+ \mid D) P(D)}{P(T^+)} = \frac{0.99 \times 0.001}{0.05094} \approx 0.0194.$$

Less than 2%. Even with a “99% accurate” test, a positive result in a rare disease is overwhelmingly likely to be a false positive. This is the base rate fallacy — ignoring the prior probability $P(D)$ leads to wildly wrong conclusions.

The key insight: when the disease is rare, the false positives from healthy people ($0.05 \times 999 \approx 50$ ) vastly outnumber the true positives from sick people ($0.99 \times 1 \approx 1$ ).

Sequential Testing: What If You Test Positive Twice?#

$$P(D \mid T_1^+, T_2^+) = \frac{P(T_1^+, T_2^+ \mid D) P(D)}{P(T_1^+, T_2^+)}.$$ $$P(T_1^+, T_2^+ \mid D) = 0.99^2 = 0.9801$$ $$P(T_1^+, T_2^+ \mid D^c) = 0.05^2 = 0.0025$$ $$P(T_1^+, T_2^+) = 0.9801 \times 0.001 + 0.0025 \times 0.999 = 0.000980 + 0.002498 = 0.003478$$ $$P(D \mid T_1^+, T_2^+) = \frac{0.000980}{0.003478} \approx 0.282.$$

Two positive tests raise the probability from 1.9% to 28.2%. A third positive test would push it to about 88%. This illustrates the power of sequential updating — each piece of evidence multiplies the odds, and even weak individual tests become compelling when combined.

Equivalently, you can use the posterior from the first test ($P(D|T_1^+) \approx 0.0194$ ) as the prior for the second test — this is sequential Bayesian updating, which we’ll formalize in Article 8.

Independence#

$$P(A \cap B) = P(A) \, P(B).$$

Equivalently, $P(A \mid B) = P(A)$ — knowing $B$ occurred tells you nothing about $A$ .

Pairwise vs Mutual Independence#

$$P(A \cap B) = P(A)P(B), \quad P(A \cap C) = P(A)P(C), \quad P(B \cap C) = P(B)P(C).$$ $$P(A \cap B \cap C) = P(A) P(B) P(C).$$

Pairwise independence does not imply mutual independence.

Counterexample. Roll two fair dice. Let $A$ = “first die is odd,” $B$ = “second die is odd,” $C$ = “sum is odd.” Then $P(A) = P(B) = P(C) = 1/2$ . The three pairs are pairwise independent (exercise: check it). But $A \cap B \cap C = \emptyset$ (two odd numbers never give an odd sum), so $P(A \cap B \cap C) = 0 \neq 1/8 = P(A)P(B)P(C)$ .

Checking pairwise independence for this example. Consider $A$ and $C$ : the first die is odd in 18 of 36 outcomes. The sum is odd in 18 of 36 outcomes (whenever one die is odd and the other even). The event $A \cap C$ = “first die odd AND sum odd” = “first die odd AND second die even” has $3 \times 3 = 9$ outcomes. So $P(A \cap C) = 9/36 = 1/4 = P(A)P(C)$ . Similarly for the other pairs. $\checkmark$

For $n$ events, mutual independence requires $2^n - n - 1$ conditions (all possible intersections of 2, 3, …, $n$ events). Pairwise independence only checks $\binom{n}{2}$ of these. For large $n$ , the gap between pairwise and mutual independence grows exponentially.

Inclusion-Exclusion: The General Case#

$$P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i < j} P(A_i \cap A_j) + \sum_{i < j < k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n).$$ $$P\left(\bigcup_{i=1}^n A_i\right) = P\left(\bigcup_{i=1}^{n-1} A_i\right) + P(A_n) - P\left(\left(\bigcup_{i=1}^{n-1} A_i\right) \cap A_n\right).$$

The last term equals $P\left(\bigcup_{i=1}^{n-1} (A_i \cap A_n)\right)$ , which by the inductive hypothesis expands with alternating signs. Collecting terms yields the general formula. $\blacksquare$

While the general inclusion-exclusion formula involves $2^n - 1$ terms, it’s invaluable for computing probabilities that involve “at least one” events — like the birthday problem below, and later, the union bound (a weakening that drops all but the first sum).

$$P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i).$$

This simple bound is loose but extremely useful — it underlies the Bonferroni correction in multiple testing (Article 7) and many results in learning theory.

Conditional Probability as a New Probability Measure#

There’s a subtle but important fact: for a fixed event $B$ with $P(B) > 0$ , the function $Q(A) = P(A|B)$ is itself a probability measure on $(\Omega, \mathcal{F})$ . That is:

  1. $Q(A) \geq 0$ for all $A$ (non-negativity)
  2. $Q(\Omega) = P(\Omega | B) = P(\Omega \cap B)/P(B) = P(B)/P(B) = 1$ (normalization)
  3. If $A_1, A_2, \ldots$ are disjoint, then $Q(\bigcup A_i) = \sum Q(A_i)$ (countable additivity)

Proof of (3): $P(\bigcup A_i | B) = P((\bigcup A_i) \cap B)/P(B) = P(\bigcup(A_i \cap B))/P(B) = \sum P(A_i \cap B)/P(B) = \sum P(A_i|B)$ . $\blacksquare$

This means every theorem we prove about probability measures automatically applies to conditional probability too. Conditional expectations, conditional variances, conditional independence — they all follow the same axioms.

Counting: Permutations and Combinations#

When all outcomes in a finite $\Omega$ are equally likely, $P(A) = |A|/|\Omega|$ , and probability reduces to counting.

Permutations#

$$P(n, k) = \frac{n!}{(n-k)!}.$$

Example. How many ways can 3 runners finish a race with 8 competitors? $P(8, 3) = 8!/5! = 8 \times 7 \times 6 = 336$ .

Combinations#

$$\binom{n}{k} = \frac{n!}{k!(n-k)!} = \frac{P(n,k)}{k!}.$$

The division by $k!$ removes the ordering — each unordered group of $k$ objects corresponds to $k!$ ordered arrangements.

Key properties:

  • $\binom{n}{k} = \binom{n}{n-k}$ (symmetry)
  • $\binom{n}{0} = \binom{n}{n} = 1$
  • $\sum_{k=0}^{n} \binom{n}{k} = 2^n$ (total number of subsets)
  • Pascal’s rule: $\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}$

Proof of Pascal’s rule. Consider $n$ objects with one “special” object. To choose $k$ objects: either include the special one (and choose $k-1$ from the remaining $n-1$ , giving $\binom{n-1}{k-1}$ ) or exclude it (and choose $k$ from the remaining $n-1$ , giving $\binom{n-1}{k}$ ). $\blacksquare$

The Binomial Theorem#

$$(a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^k b^{n-k}.$$

Setting $a = b = 1$ gives $\sum \binom{n}{k} = 2^n$ , the total number of subsets. Setting $a = 1, b = -1$ gives $\sum (-1)^k \binom{n}{k} = 0$ , so the number of even-sized subsets equals the number of odd-sized subsets.

Multinomial Coefficients#

$$\binom{n}{n_1, n_2, \ldots, n_r} = \frac{n!}{n_1! \, n_2! \cdots n_r!}.$$ $$\frac{11!}{1! \cdot 4! \cdot 4! \cdot 2!} = \frac{39916800}{1 \cdot 24 \cdot 24 \cdot 2} = 34650.$$

Stars and Bars#

$$\binom{k + n - 1}{n - 1} = \binom{k + n - 1}{k}.$$

Example. Distribute 10 identical cookies among 4 children: $\binom{13}{3} = 286$ ways.

Proof. Represent the items as $k$ stars and the bin separators as $n-1$ bars. A valid arrangement is any sequence of $k$ stars and $n-1$ bars, and the number of such sequences is $\binom{k+n-1}{n-1}$ . $\blacksquare$

The Birthday Problem#

Problem. In a room of $n$ people, what is the probability that at least two share a birthday? (Assume 365 equally likely birthdays, ignore leap years.)

Birthday problem curve

$$P(A^c) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdots \frac{365 - n + 1}{365} = \prod_{k=0}^{n-1} \frac{365 - k}{365}.$$ $$P(A) = 1 - \prod_{k=0}^{n-1} \left(1 - \frac{k}{365}\right).$$

The Approximation#

$$\ln P(A^c) = \sum_{k=0}^{n-1} \ln\left(1 - \frac{k}{365}\right) \approx -\sum_{k=0}^{n-1} \frac{k}{365} = -\frac{n(n-1)}{2 \cdot 365}.$$ $$P(A^c) \approx e^{-n(n-1)/730}.$$

Setting $P(A) = 0.5$ : $e^{-n(n-1)/730} = 0.5$ , giving $n(n-1) = 730 \ln 2 \approx 506$ , so $n \approx 23$ .

With just 23 people, there’s a 50% chance of a birthday match. Most people guess something much higher (like 183) because they confuse “someone shares MY birthday” with “some pair shares A birthday.” The number of pairs grows quadratically: $\binom{23}{2} = 253$ pairs, each with a small chance of matching, but 253 chances add up fast.

The Generalized Birthday Problem#

$$n \approx \sqrt{2d \ln 2} \approx 1.177 \sqrt{d}.$$

This has important applications beyond party tricks:

  • Hash collisions: A hash function with $d = 2^{128}$ possible outputs. Collisions become likely after about $2^{64}$ inputs — this is why 128-bit hashes are considered secure for billions of objects but not for $2^{64}$ .
  • DNA profiling: With $d$ possible genotypes, how many people must you test before two match by chance?
  • Random sampling: How many random samples from a population of size $d$ before you get a repeat?

The quadratic scaling ($n \sim \sqrt{d}$ ) is the key surprise. People’s intuition is linear ($n \sim d/2$ ), which is off by a huge factor.

Python Simulation#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib.pyplot as plt

def birthday_exact(n, days=365):
    """Exact probability of at least one shared birthday among n people."""
    p_no_match = 1.0
    for k in range(n):
        p_no_match *= (days - k) / days
    return 1 - p_no_match

def birthday_simulation(n, days=365, trials=100_000):
    """Monte Carlo estimate of birthday collision probability."""
    collisions = 0
    for _ in range(trials):
        birthdays = np.random.randint(0, days, size=n)
        if len(set(birthdays)) < n:
            collisions += 1
    return collisions / trials

ns = np.arange(1, 81)
exact = [birthday_exact(n) for n in ns]
approx = [1 - np.exp(-n * (n - 1) / 730) for n in ns]

sim_ns = [10, 20, 23, 30, 40, 50, 60, 70]
sim_probs = [birthday_simulation(n) for n in sim_ns]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(ns, exact, 'b-', linewidth=2, label='Exact')
ax.plot(ns, approx, 'r--', linewidth=1.5, label='Approximation')
ax.scatter(sim_ns, sim_probs, color='green', zorder=5, label='Simulation (100k trials)')
ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.7)
ax.axvline(x=23, color='gray', linestyle=':', alpha=0.7)
ax.set_xlabel('Number of people', fontsize=13)
ax.set_ylabel('P(at least one match)', fontsize=13)
ax.set_title('The Birthday Problem', fontsize=15)
ax.legend(fontsize=12)
ax.set_xlim(1, 80)
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('birthday_problem.png', dpi=150)
plt.show()

Running this simulation confirms the exact calculation: at $n = 23$ , the probability is approximately 0.507, and at $n = 50$ it exceeds 0.97. The green dots from the Monte Carlo simulation cluster tightly around the blue exact curve, illustrating that randomness is predictable in aggregate even when individual outcomes are not.

$n$$P(\text{match})$
100.117
200.411
230.507
300.706
400.891
500.970
600.994
700.999

The Monty Hall Problem#

No article on probability foundations is complete without this classic.

Setup. You’re on a game show. There are three doors. Behind one is a car; behind the other two are goats. You pick a door (say Door 1). The host, who knows what’s behind the doors, opens another door (say Door 3) to reveal a goat. Should you switch to Door 2, or stick with Door 1?

Intuitive (wrong) answer: It doesn’t matter — there are two doors left, so it’s 50/50.

Correct answer: Switch. Switching wins with probability 2/3.

$$P(H_3 | C_1) = 1/2 \quad \text{(host chooses randomly between 2 and 3)}$$ $$P(H_3 | C_2) = 1 \quad \text{(host must open 3, since 2 has the car)}$$ $$P(H_3 | C_3) = 0 \quad \text{(host never reveals the car)}$$ $$P(C_2 | H_3) = \frac{P(H_3 | C_2) P(C_2)}{P(H_3)} = \frac{1 \cdot 1/3}{P(H_3)}.$$ $$P(H_3) = P(H_3|C_1)P(C_1) + P(H_3|C_2)P(C_2) + P(H_3|C_3)P(C_3) = \frac{1}{2}\cdot\frac{1}{3} + 1\cdot\frac{1}{3} + 0 = \frac{1}{2}.$$ $$P(C_2 | H_3) = \frac{1/3}{1/2} = \frac{2}{3}. \quad \blacksquare$$

The host’s action gives you information. Before the host opens a door, Door 1 has a 1/3 chance of being correct. The host revealing a goat behind Door 3 doesn’t change what’s behind Door 1 — it’s still 1/3. But the 2/3 probability that was split between Doors 2 and 3 is now concentrated entirely on Door 2.

Why this problem matters beyond game shows: The Monty Hall problem illustrates a general principle: conditioning on information changes probabilities in ways that depend on the mechanism generating the information. The host’s choice was constrained (never reveal the car), and this constraint is what makes switching advantageous. In machine learning, analogous situations arise in selection bias, survivorship bias, and collider bias in causal inference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

def monty_hall_sim(n_trials=100_000):
    """Simulate the Monty Hall problem."""
    np.random.seed(42)
    car_positions = np.random.randint(0, 3, n_trials)
    initial_choices = np.random.randint(0, 3, n_trials)

    stay_wins = np.sum(car_positions == initial_choices)
    switch_wins = n_trials - stay_wins  # switching wins whenever staying loses

    print(f"Stay wins:   {stay_wins/n_trials:.4f} (theory: 0.3333)")
    print(f"Switch wins: {switch_wins/n_trials:.4f} (theory: 0.6667)")

monty_hall_sim()

Probability as a Measure of Belief#

We’ve treated probability axiomatically — as a function satisfying three rules. But what does a probability mean? There are three main interpretations:

Classical (Laplace): Probability is the ratio of favorable outcomes to total equally likely outcomes. This works for coins and dice but fails when outcomes aren’t equally likely (what’s the “probability” of rain tomorrow?).

Frequentist: Probability is the long-run relative frequency of an event in repeated trials. $P(A) = \lim_{n \to \infty} \frac{\text{count of } A \text{ in } n \text{ trials}}{n}$ . This is precise but requires that the experiment be (at least conceptually) repeatable.

Bayesian (subjective): Probability quantifies an agent’s degree of belief about an uncertain proposition. It need not correspond to any physical frequency. Two rational agents with different prior information can assign different probabilities to the same event and both be “correct.”

All three interpretations use the same axioms. The difference is philosophical, but it has practical consequences: it determines whether you use frequentist statistics (confidence intervals, p-values) or Bayesian statistics (priors, posteriors, credible intervals). We’ll return to this debate in Article 8.

Summary#

Here is a reference table of the key formulas developed in this article:

RuleFormulaWhen to Use
Complement$P(A^c) = 1 - P(A)$“At least one” problems
Inclusion-exclusion$P(A \cup B) = P(A) + P(B) - P(A \cap B)$Union of events
Union bound$P(\bigcup A_i) \leq \sum P(A_i)$Quick upper bounds
Conditional$P(A\mid B) = P(A \cap B)/P(B)$Updating with information
Multiplication$P(A \cap B) = P(A\mid B)P(B)$Joint event probability
Total probability$P(A) = \sum P(A\mid B_i)P(B_i)$Decompose by cases
Bayes’ theorem$P(B\mid A) = \frac{P(A\mid B)P(B)}{P(A)}$Reverse conditioning
Independence$P(A \cap B) = P(A)P(B)$Simplify products
Combinations$\binom{n}{k} = \frac{n!}{k!(n-k)!}$Equally likely outcomes

Common Pitfalls#

Before we move on, here are mistakes that even experienced practitioners make:

1. The Gambler’s Fallacy. “The roulette wheel landed on red 5 times in a row, so black is due.” If spins are independent, the probability of black on the next spin is exactly the same as it always is. Past outcomes don’t affect future ones.

2. Confusing $P(A|B)$ with $P(B|A)$ . The probability of having the disease given a positive test ($P(D|T^+)$ ) is not the same as the probability of testing positive given the disease ($P(T^+|D)$ ). This is the prosecutor’s fallacy in legal settings, and it has led to wrongful convictions.

3. Assuming independence without justification. If events are not independent, $P(A \cap B) \neq P(A)P(B)$ . Assuming independence when it doesn’t hold can drastically underestimate joint probabilities (the 2008 financial crisis was partly caused by models that assumed mortgage defaults were independent).

4. Base rate neglect. Ignoring prior probabilities when interpreting evidence. As the medical testing example showed, even a highly accurate test produces mostly false positives when the base rate is low.

5. The birthday problem intuition failure. Expecting linear scaling ($n \sim d/2$ ) when the correct scaling is quadratic ($n \sim \sqrt{d}$ ). Happens in hash collisions, DNA matching, and any problem involving pairwise comparisons.

6. Confusing “unlikely” with “impossible.” A probability of 0.01 means 1 in 100. Run the experiment 100 times, and you expect it to happen once. In a world with billions of people, million-to-one events happen thousands of times daily. Improbable things are not miraculous; they are inevitable at scale.

7. Simpson’s paradox. A trend that appears in several groups can reverse when the groups are combined. For example, Treatment A might have a higher success rate than Treatment B in both men and women, yet Treatment B has a higher overall success rate — if men and women are unevenly distributed between treatments. The resolution is to condition on the confounding variable (gender), which brings us back to conditional probability and Bayes’ theorem. Simpson’s paradox demonstrates why causal reasoning requires more than just probability — you need to know the causal structure to decide whether to condition.

What’s Next#

We’ve built the formal scaffolding: sample spaces, axioms, conditional probability, and independence. But so far, we’ve only talked about events — yes/no questions about outcomes. The real power of probability comes when we attach numbers to outcomes, turning events into random variables. That’s where distributions live, and where the mathematics starts to connect directly to data science and machine learning.

In the next article, we define random variables, probability mass functions, probability density functions, and survey the distributions that show up everywhere in practice: Bernoulli, Binomial, Poisson, Gaussian, Exponential, and more. We’ll see how each distribution arises naturally from specific modeling assumptions, and build a reference table you’ll use for the rest of the series.

In this series

Probability and Statistics 8 parts

  1. 01 Probability and Statistics (1): Probability Spaces — Why We Need Axioms (But Won't Overdo It) you are here
  2. 02 Probability and Statistics (2): Random Variables and the Distributions That Matter
  3. 03 Probability and Statistics (3): Expectation, Variance, and the Moment-Generating Trick
  4. 04 Probability and Statistics (4): Joint Distributions, Marginalization, and Independence
  5. 05 Probability and Statistics (5): Law of Large Numbers and the Central Limit Theorem
  6. 06 Probability and Statistics (6): Estimation — MLE, MAP, and the Bias-Variance Story
  7. 07 Probability and Statistics (7): Hypothesis Testing — p-Values, Confidence Intervals, and All Their Pitfalls
  8. 08 Probability and Statistics (8): Bayesian Statistics — Priors, Posteriors, and Why Frequentists Argue

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub