Probability Theory
For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.
References
- All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
- Introduction to Probability, 2nd Ed. by Joseph K. Blitzstein and Jessica Hwang
Notation
- 0 \in \N and \N^+ = \N \setminus \{0\}.
1. Probability
Def. Probability
A probability space consists of a sample space \Omega and a probability function (or probability distribution or probability measure) P maps event A \subseteq \Omega to P(A) \in [0, 1]. The function P must satisfy the following axioms:
- P(\varnothing) = 0 and P(\Omega)=1.
- If A_1, A_2 ... are disjoint events (mutually exclusive), then P\left(\bigcup_{j} A_j\right) = \sum_{j} P(A_j)
Elements \Omega are called sample outcomes, realizations, or elements.
So basically an event is a subset of the sample space and the probability measure satisfies some simple yet very powerful axioms.
Thm. Basic Probability Properties
For any events A and B:
- P(A^c) = 1 - P(A).
- A \subseteq B \implies P(A) \leq P(B).
- P(A \cup B) = P(A) + P(B) - P(A \cap B)
Def. Monotone Increase/Decrease
A sequence of sets A_1, A_2, ... is said to be monotone increasing if
A_1 \subseteq A_2 \subseteq \cdots
and monotone decreasing if
A_1 \supseteq A_2 \supseteq \cdots
In the former case we define the limit
\lim_{n \to \infty} A_n = \bigcup_{i = 1}^{\infty} A_i
and for the latter case we define
\lim_{n \to \infty} A_n = \bigcap_{i = 1}^{\infty} A_i
Either case is denoted with A_n \to A.
Thm. Continuity of Probabilities
Let A_n \to A, then P(A_n) \to P(A) as n \to \infty.
Def. Uniform Probability Distribution
If the sample space \Omega is finite and if each outcome is equally likely, then
P(A) = \dfrac{|A|}{|\Omega|}
so that P is called the uniform probability distribution.
Thm. Inclusion-Exclusion
\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}
Def. Conditional Probability
Let A and be B be events, the we define the conditional probability of A given B as
P(A \mid B) := \frac{P(A \cap B)}{P(B)}
Thm. Conditional Probability
P(A \cap B) = P(B) P(A \mid B)=P(A) P(B \mid A) P(A_1, ... \>, A_n) = P(A_1) P(A_2 \mid A_1)P(A_3 \mid A_2,A_1) \cdots P(A_n \mid A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2 \mid A_1) P(A_3 \mid A_2,A_1) = P(A_2) P(A_3 \mid A_2) P(A_1 \mid A_2, A_3)
Thm. Bayes’ Theorem
Let A and B events, then we have
P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}
where P(A) is called the prior and P(A \mid B) is called the posterior probability of A.
Thm. Law of Total Probability (LOTP)
Let A_1, ..., A_n partition the sample space S, then
P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i) P(A_i)
Thm. Generalized Bayes’ Theorem
Let A_1, ..., A_n be a partition of the sample space \Omega such that each A_i and B has a positive probability, then
P(A_i \mid B) = \dfrac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^n P(B \mid A_j) P(A_j)}
Def. Odds
\text{odds}(A) := \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}
Remark. Conditional Probabilities Are Probabilities
- 0 \leq P(A \mid E) \leq 1.
- P(\varnothing \mid E) = 0 and P(S \mid E) = 1.
- If A_1, A_2, ... are disjoint events then P(\bigcup_j A_j \mid E) = \sum_{j} P(A_j \mid E).
- P(A^c \mid E) = 1 - P(A \mid E).
- (Inclusion-Exclusion) P(A \cup B \mid E) = P(A \mid E) + P(B \mid E) - P(A \cap B \mid E).
So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.
Thm. Bayes with Extra Condition
Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have
P(A \mid B, E) = \frac{P(B \mid A, E) P(A \mid E)}{P(B \mid E)}
Thm. LOTP with Extra Condition
Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then
P(B \mid E) = \sum_{i=1}^{n} P(B \mid A_i, E)P(A_i \mid E)
Def. Independence
Two events A and B are called independent if (and only if)
P(A \cap B) = P(A)P(B)
Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.
Therefore, just recall that “disjoint events with positive probability are not independent”.
Thm. TFAE
The following are equivalent if P(A)>0 and P(B)>0,
- A and B are independent.
- P(A \mid B)=P(A).
- P(B \mid A)=P(B).
So, knowing A gives us no information about B. This may not be the case with disjointness.
Thm. Indepence of
If A and B are independent events then so are
- A and B^c,
- A^c and B,
- A^c and B^c.
Def. 3-independence
Events A, B and C are independent if
$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}$$
Beware that pairwise independence does not imply independence!
Def. n-independence
Events A_1, ..., A_n are independent if they are:
- pairwise independent,
- triplewise independent,
- quadruplewise independent,
- …
- P(A \cap ... \cap A_n) = P(A_1) ... P(A_n).
Def. Conditional Independence
Event A and B are called conditionally independent for event E if
P(A \cap B \mid E) = P(A \mid E) \> P(B \mid E)
Independence does not imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.
Remarks
TODO: Sigma Algebra, Field and Borel (p. 26)
2. Random Variables
Def. Random Variable (R.V.)
A random variable X is a just map (function)
X: \Omega \to \R
which maps the elements (called outcomes) of \Omega to the real line \R.
Technically it must be a measurable function.
Notation. Random Variable
Def. Cumulative Distribution Function (CDF)
Given a random variable X, the cumulative distribution function F_X is defined as
\def\arraystretch{1.25} \begin{array}{rcl} F_X: \enspace \R &\to& [0, 1] \subseteq \R \\ x &\mapsto& P(X \leq x) \end{array}
Def. Discrete Random Variable
TODO: Revise this
A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.
If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.
Def. Probability Mass Function
The probability mass function (PMF) of a discrete r.v. X is the function p_X given by
p_X (x) = P(X=x)
Note that this is positive if x is in the support of X and 0 otherwise.
Remark. Notation
We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.
Thm. Valid PMFs
TODO: Rewrite
Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:
- p_X(x) > 0 if x = x_j and p_X(x) = 0 otherwise.
- \sum_{j=1} p_X(x_j) = 1
3. Discrete Distributions
Def. Bernoulli Distribution
Wikipedia: Bernoulli Distribution
A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.
We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.
The parameter p is often called the success probability of the \text{Bern}(p) distribution.
Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.
Def. Indicator Random Variable
The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.
Note that I_A \thicksim \text{Bern}(p) with p = P(A).
Def. Binomial Distribution
Todo: Rewrite
Wikipedia: Binomial Distribution
Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.
The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).
Thm. Binomial PMF
If X \thicksim \text{Bin}(n, p), then the PMF of X is
P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}
for k \in \N. and k \leq n. If k > n, then P(X=k)=0.
Thm. ~
Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).
Thm. ~
Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that
P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)
for all j \geq 0.
Thm. Hypergeometric Distribution
Todo: Further define
Wikipedia: Hypergeometric Distribution
If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is
P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}
for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.
Thm. ~
The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.
Def. Discrete Uniform Distribution
Wikipedia: Discrete Uniform Distribution
The PMF of X \thicksim \text{DUnif}(C) is
P(X=x) = \dfrac{1}{|C|}
for x \in C and 0 otherwise.
Def. Cumulative Distribution Function
The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where
F_X(x) =P(X \leq x)
Thm. Valid CDFs
For any CDF F_X, or simply F, we have
- x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
- F(a) = \lim_{x \to a^+} F(x)
- \lim_{x \to - \infty} F(x) = 0
- \lim_{x \to \infty} F(x) = 1
Def. Function of a Random Variable
For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).