Probability Theory

For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.

References

Introduction to Probability, 2nd Ed. by Joseph K. Blitzstein and Jessica Hwang.

Notation

0 \in \N and \N^+ = \N \setminus \{0\}.

1. Probability and Conditional Probability

Thm. Identities

\binom{n}{k} = \binom{n}{n - k}

n\binom{n-1}{k-1} = k\binom{n}{k}

Wikipedia: Vandermonde’s identity

\binom{m+n}{k} = \sum_{j=0}^{k} \binom{m}{j} \binom{n}{k - j}

Wikipedia: Generalized Vandermonde’s identity

\binom{m_1 + \cdots + m_p}{k} = \sum_{j_1 + \dots + j_p = k} \binom{m_1}{j_1} \binom{m_2}{j_2} \cdots \binom{m_p}{j_p}

Def. General Definition of Probability

Todo: Better definition

A probability space consists of a sample space S and a probability function P which takes an event A \subseteq S as input and returns P(A) \in [0, 1] as output. The function P must satisfy the following axioms:

P(\varnothing) = 0 and P(S)=1.
If A_1, A_2 ... are disjoint events (mutually exclusive), then P\left(\bigcup_{j} A_j\right) = \sum_{j} P(A_j)

Thm. Properties of Probability

For any events A and B:

P(A^c) = 1 - P(A).
A \subseteq B \implies P(A) \leq P(B).
P(A \cup B) = P(A) + P(B) - P(A \cap B)

Thm. Inclusion-Exclusion

Wikipedia: Inclusion-Exclusion Principle

\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}

Def. Conditional Probability

P(A | B) \triangleq \frac{P(A \cap B)}{P(B)}

Thm. Conditional Probability

P(A \cap B) = P(B) P(A|B)=P(A) P(B|A) P(A_1, ... \>, A_n) = P(A_1) P(A_2|A_1)P(A_3|A_2,A_1) \cdots P(A_n | A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2|A_1) P(A_3|A_2,A_1) = P(A_2) P(A_3|A_2) P(A_1 | A_2, A_3)

Thm. Bayes’ Theorem

Wikipedia: Bayes’ Theorem

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

where P(A) is the prior and P(A|B) is the posterior.

Def. Odds

\text{odds}(A) \triangleq \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}

Thm. Odds Bayes

\begin{array}{ccccc} \dfrac{P(A|B)}{P(A^c | B)} &=& \dfrac{P(B|A)}{P(B|A^c)} & \cdot & \dfrac{P(A)}{P(A^c)} \\ \\ \small \text{Posterior} & & \small \text{Likelihood} & & \small \text{Prior} \\ \small \text{Odds} & & \small \text{Ratio} & & \small \text{Odds} \\ \end{array}

Thm. Law of Total Probability (LOTP)

Let A_1, ..., A_n partition the sample space S, then

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B|A_i) P(A_i)

Ex. Sensitivity and Specificity

Wikipedia: Sensitivity and Specificity

Let D denote the event of true positive and let T denote the event of test is positive. P(T|D) is called sensitivity or true positive rate. P(T^c|D^c) is called specificity or true negative rate.

Remark. Conditional Probabilities Are Probabilities

0 \leq P(A|E) \leq 1.
P(\varnothing | E) = 0 and P(S|E) = 1.
If A_1, A_2, ... are disjoint events then P(\bigcup_j A_j |E) = \sum_{j} P(A_j | E).
P(A^c | E) = 1 - P(A | E).
(Inclusion-Exclusion) P(A \cup B | E) = P(A|E) + P(B|E) - P(A \cap B | E).

So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.

Thm. Bayes with Extra Condition

Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have

P(A|B, E) = \frac{P(B|A, E) P(A|E)}{P(B|E)}

Thm. LOTP with Extra Condition

Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then

P(B|E) = \sum_{i=1}^{n} P(B|A_i, E)P(A_i | E)

Def. Independence

Two events A and B are independent if P(A \cap B) = P(A)P(B).

Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.

Thm. ~

TFAE if P(A)>0 and P(B)>0:

A and B are independent.
P(A|B)=P(A).
P(B|A)=P(B).

So, knowing A gives us no information about B. This may not be the case with disjointness.

Thm. ~

If A and B are independent events then so are

A and B^c,
A^c and B,
A^c and B^c.

Def. 3-independence

Events A, B and C are independent if

$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}

Beware that pairwise independence does not imply independence!

Def. n-independence

Events A_1, ..., A_n are independent if they are:

pairwise independent,
triplewise independent,
quadruplewise independent,
…
P(A \cap ... \cap A_n) = P(A_1) ... P(A_n).

Def. Conditional Independence

Event A and B are called conditionally independent for event E if

P(A \cap B |E) = P(A|E) \> P(B|E)

Remark. ~

Independence does NOT imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.

2. (Discrete) Random Variables

Def. Random Variable

A random variable X (r.v.) is a function from the sample S to the set of real numbers \R.

So, a random variable is a function from the elements (not events) of S to a real number. Since X is a function, X(s) \in \R is defined for all s \in S.

Def. Discrete Random Variable

TODO: Revise this

A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.

If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.

Def. Probability Mass Function

The probability mass function (PMF) of a discrete r.v. X is the function p_X given by

p_X (x) = P(X=x)

Note that this is positive if x is in the support of X and 0 otherwise.

Remark. Notation

We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.

Thm. Valid PMFs

TODO: Rewrite

Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:

p_X(x) > 0 if x = x_j and p_X(x) = 0 otherwise.
\sum_{j=1} p_X(x_j) = 1

3. Discrete Distributions

Def. Bernoulli Distribution

Wikipedia: Bernoulli Distribution

A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.

We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.

The parameter p is often called the success probability of the \text{Bern}(p) distribution.

Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.

Def. Indicator Random Variable

The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.

Note that I_A \thicksim \text{Bern}(p) with p = P(A).

Def. Binomial Distribution

Todo: Rewrite

Wikipedia: Binomial Distribution

Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.

The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).

Thm. Binomial PMF

If X \thicksim \text{Bin}(n, p), then the PMF of X is

P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}

for k \in \N. and k \leq n. If k > n, then P(X=k)=0.

Thm. ~

Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).

Thm. ~

Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that

P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)

for all j \geq 0.

Thm. Hypergeometric Distribution

Todo: Further define

Wikipedia: Hypergeometric Distribution

If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is

P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}

for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.

Thm. ~

The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.

Def. Discrite Uniform Distribution

Wikipedia: Discrete Uniform Distribution

The PMF of X \thicksim \text{DUnif}(C) is

P(X=x) = \dfrac{1}{|C|}

for x \in C and 0 otherwise.

Def. Cumulative Distribution Function

The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where

F_X(x) =P(X \leq x)

Thm. Valid CDFs

For any CDF F_X, or simply F, we have

x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
F(a) = \lim_{x \to a^+} F(x)
\lim_{x \to - \infty} F(x) = 0
\lim_{x \to \infty} F(x) = 1

Def. Function of a Random Variable

For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).

Table of Contents

Probability Theory

References

Notation

1. Probability and Conditional Probability

Thm. Identities

Def. General Definition of Probability

Thm. Properties of Probability

Thm. Inclusion-Exclusion

Def. Conditional Probability

Thm. Conditional Probability

Thm. Bayes’ Theorem

Def. Odds

Thm. Odds Bayes

Thm. Law of Total Probability (LOTP)

Ex. Sensitivity and Specificity

Remark. Conditional Probabilities Are Probabilities

Thm. Bayes with Extra Condition

Thm. LOTP with Extra Condition

Def. Independence

Thm. ~

Thm. ~

Def. 3-independence

Def. n-independence

Def. Conditional Independence

Remark. ~

2. (Discrete) Random Variables

Def. Random Variable

Def. Discrete Random Variable

Def. Probability Mass Function

Remark. Notation

Thm. Valid PMFs

3. Discrete Distributions

Def. Bernoulli Distribution

Def. Indicator Random Variable

Def. Binomial Distribution

Thm. Binomial PMF

Thm. ~

Thm. ~

Thm. Hypergeometric Distribution

Thm. ~

Def. Discrite Uniform Distribution

Def. Cumulative Distribution Function

Thm. Valid CDFs

Def. Function of a Random Variable