Probability Theory
For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.
References
- Introduction to Probability, 2nd Ed. by Joseph K. Blitzstein and Jessica Hwang.
Notation
- 0 \in \N and \N^+ = \N \setminus \{0\}.
1. Probability and Conditional Probability
Thm. Identities
\binom{n}{k} = \binom{n}{n - k}
n\binom{n-1}{k-1} = k\binom{n}{k}
Wikipedia: Vandermonde’s identity
\binom{m+n}{k} = \sum_{j=0}^{k} \binom{m}{j} \binom{n}{k - j}
Wikipedia: Generalized Vandermonde’s identity
\binom{m_1 + \cdots + m_p}{k} = \sum_{j_1 + \dots + j_p = k} \binom{m_1}{j_1} \binom{m_2}{j_2} \cdots \binom{m_p}{j_p}
Def. General Definition of Probability
Todo: Better definition
A probability space consists of a sample space S and a probability function P which takes an event A \subseteq S as input and returns P(A) \in [0, 1] as output. The function P must satisfy the following axioms:
- P(\varnothing) = 0 and P(S)=1.
- If A_1, A_2 ... are disjoint events (mutually exclusive), then P\left(\bigcup_{j} A_j\right) = \sum_{j} P(A_j)
Thm. Properties of Probability
For any events A and B:
- P(A^c) = 1 - P(A).
- A \subseteq B \implies P(A) \leq P(B).
- P(A \cup B) = P(A) + P(B) - P(A \cap B)
Thm. Inclusion-Exclusion
Wikipedia: Inclusion-Exclusion Principle
\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}
Def. Conditional Probability
P(A | B) \triangleq \frac{P(A \cap B)}{P(B)}
Thm. Conditional Probability
P(A \cap B) = P(B) P(A|B)=P(A) P(B|A) P(A_1, ... \>, A_n) = P(A_1) P(A_2|A_1)P(A_3|A_2,A_1) \cdots P(A_n | A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2|A_1) P(A_3|A_2,A_1) = P(A_2) P(A_3|A_2) P(A_1 | A_2, A_3)
Thm. Bayes’ Theorem
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
where P(A) is the prior and P(A|B) is the posterior.
Def. Odds
\text{odds}(A) \triangleq \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}
Thm. Odds Bayes
\begin{array}{ccccc} \dfrac{P(A|B)}{P(A^c | B)} &=& \dfrac{P(B|A)}{P(B|A^c)} & \cdot & \dfrac{P(A)}{P(A^c)} \\ \\ \small \text{Posterior} & & \small \text{Likelihood} & & \small \text{Prior} \\ \small \text{Odds} & & \small \text{Ratio} & & \small \text{Odds} \\ \end{array}
Thm. Law of Total Probability (LOTP)
Let A_1, ..., A_n partition the sample space S, then
P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B|A_i) P(A_i)
Ex. Sensitivity and Specificity
Wikipedia: Sensitivity and Specificity
Let D denote the event of true positive and let T denote the event of test is positive. P(T|D) is called sensitivity or true positive rate. P(T^c|D^c) is called specificity or true negative rate.
Remark. Conditional Probabilities Are Probabilities
- 0 \leq P(A|E) \leq 1.
- P(\varnothing | E) = 0 and P(S|E) = 1.
- If A_1, A_2, ... are disjoint events then P(\bigcup_j A_j |E) = \sum_{j} P(A_j | E).
- P(A^c | E) = 1 - P(A | E).
- (Inclusion-Exclusion) P(A \cup B | E) = P(A|E) + P(B|E) - P(A \cap B | E).
So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.
Thm. Bayes with Extra Condition
Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have
P(A|B, E) = \frac{P(B|A, E) P(A|E)}{P(B|E)}
Thm. LOTP with Extra Condition
Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then
P(B|E) = \sum_{i=1}^{n} P(B|A_i, E)P(A_i | E)
Def. Independence
Two events A and B are independent if P(A \cap B) = P(A)P(B).
Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.
Thm. ~
TFAE if P(A)>0 and P(B)>0:
- A and B are independent.
- P(A|B)=P(A).
- P(B|A)=P(B).
So, knowing A gives us no information about B. This may not be the case with disjointness.
Thm. ~
If A and B are independent events then so are
- A and B^c,
- A^c and B,
- A^c and B^c.
Def. 3-independence
Events A, B and C are independent if
$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}$$
Beware that pairwise independence does not imply independence!
Def. n-independence
Events A_1, ..., A_n are independent if they are:
- pairwise independent,
- triplewise independent,
- quadruplewise independent,
- …
- P(A \cap ... \cap A_n) = P(A_1) ... P(A_n).
Def. Conditional Independence
Event A and B are called conditionally independent for event E if
P(A \cap B |E) = P(A|E) \> P(B|E)
Remark. ~
Independence does NOT imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.
2. (Discrete) Random Variables
Def. Random Variable
A random variable X (r.v.) is a function from the sample S to the set of real numbers \R.
So, a random variable is a function from the elements (not events) of S to a real number. Since X is a function, X(s) \in \R is defined for all s \in S.
Def. Discrete Random Variable
TODO: Revise this
A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.
If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.
Def. Probability Mass Function
The probability mass function (PMF) of a discrete r.v. X is the function p_X given by
p_X (x) = P(X=x)
Note that this is positive if x is in the support of X and 0 otherwise.
Remark. Notation
We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.
Thm. Valid PMFs
TODO: Rewrite
Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:
- p_X(x) > 0 if x = x_j and p_X(x) = 0 otherwise.
- \sum_{j=1} p_X(x_j) = 1
3. Discrete Distributions
Def. Bernoulli Distribution
Wikipedia: Bernoulli Distribution
A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.
We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.
The parameter p is often called the success probability of the \text{Bern}(p) distribution.
Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.
Def. Indicator Random Variable
The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.
Note that I_A \thicksim \text{Bern}(p) with p = P(A).
Def. Binomial Distribution
Todo: Rewrite
Wikipedia: Binomial Distribution
Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.
The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).
Thm. Binomial PMF
If X \thicksim \text{Bin}(n, p), then the PMF of X is
P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}
for k \in \N. and k \leq n. If k > n, then P(X=k)=0.
Thm. ~
Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).
Thm. ~
Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that
P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)
for all j \geq 0.
Thm. Hypergeometric Distribution
Todo: Further define
Wikipedia: Hypergeometric Distribution
If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is
P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}
for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.
Thm. ~
The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.
Def. Discrite Uniform Distribution
Wikipedia: Discrete Uniform Distribution
The PMF of X \thicksim \text{DUnif}(C) is
P(X=x) = \dfrac{1}{|C|}
for x \in C and 0 otherwise.
Def. Cumulative Distribution Function
The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where
F_X(x) =P(X \leq x)
Thm. Valid CDFs
For any CDF F_X, or simply F, we have
- x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
- F(a) = \lim_{x \to a^+} F(x)
- \lim_{x \to - \infty} F(x) = 0
- \lim_{x \to \infty} F(x) = 1
Def. Function of a Random Variable
For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).