Table of Contents

Probability Theory

For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.

References

Notation

1. Probability and Conditional Probability

Thm. Identities

\binom{n}{k} = \binom{n}{n - k}

n\binom{n-1}{k-1} = k\binom{n}{k}

Wikipedia: Vandermonde’s identity

\binom{m+n}{k} = \sum_{j=0}^{k} \binom{m}{j} \binom{n}{k - j}

Wikipedia: Generalized Vandermonde’s identity

\binom{m_1 + \cdots + m_p}{k} = \sum_{j_1 + \dots + j_p = k} \binom{m_1}{j_1} \binom{m_2}{j_2} \cdots \binom{m_p}{j_p}

Def. General Definition of Probability

Todo: Better definition

A probability space consists of a sample space S and a probability function P which takes an event A \subseteq S as input and returns P(A) \in [0, 1] as output. The function P must satisfy the following axioms:

Thm. Properties of Probability

For any events A and B:

Thm. Inclusion-Exclusion

Wikipedia: Inclusion-Exclusion Principle

\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}

Def. Conditional Probability

P(A | B) \triangleq \frac{P(A \cap B)}{P(B)}

Thm. Conditional Probability

P(A \cap B) = P(B) P(A|B)=P(A) P(B|A) P(A_1, ... \>, A_n) = P(A_1) P(A_2|A_1)P(A_3|A_2,A_1) \cdots P(A_n | A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2|A_1) P(A_3|A_2,A_1) = P(A_2) P(A_3|A_2) P(A_1 | A_2, A_3)

Thm. Bayes’ Theorem

Wikipedia: Bayes’ Theorem

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

where P(A) is the prior and P(A|B) is the posterior.

Def. Odds

\text{odds}(A) \triangleq \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}

Thm. Odds Bayes

\begin{array}{ccccc} \dfrac{P(A|B)}{P(A^c | B)} &=& \dfrac{P(B|A)}{P(B|A^c)} & \cdot & \dfrac{P(A)}{P(A^c)} \\ \\ \small \text{Posterior} & & \small \text{Likelihood} & & \small \text{Prior} \\ \small \text{Odds} & & \small \text{Ratio} & & \small \text{Odds} \\ \end{array}

Thm. Law of Total Probability (LOTP)

Let A_1, ..., A_n partition the sample space S, then

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B|A_i) P(A_i)

Ex. Sensitivity and Specificity

Wikipedia: Sensitivity and Specificity

Let D denote the event of true positive and let T denote the event of test is positive. P(T|D) is called sensitivity or true positive rate. P(T^c|D^c) is called specificity or true negative rate.

Remark. Conditional Probabilities Are Probabilities

So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.

Thm. Bayes with Extra Condition

Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have

P(A|B, E) = \frac{P(B|A, E) P(A|E)}{P(B|E)}

Thm. LOTP with Extra Condition

Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then

P(B|E) = \sum_{i=1}^{n} P(B|A_i, E)P(A_i | E)

Def. Independence

Two events A and B are independent if P(A \cap B) = P(A)P(B).

Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.

Thm. ~

TFAE if P(A)>0 and P(B)>0:

So, knowing A gives us no information about B. This may not be the case with disjointness.

Thm. ~

If A and B are independent events then so are

Def. 3-independence

Events A, B and C are independent if

$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}

$$

Beware that pairwise independence does not imply independence!

Def. n-independence

Events A_1, ..., A_n are independent if they are:

Def. Conditional Independence

Event A and B are called conditionally independent for event E if

P(A \cap B |E) = P(A|E) \> P(B|E)

Remark. ~

Independence does NOT imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.

2. (Discrete) Random Variables

Def. Random Variable

A random variable X (r.v.) is a function from the sample S to the set of real numbers \R.

So, a random variable is a function from the elements (not events) of S to a real number. Since X is a function, X(s) \in \R is defined for all s \in S.

Def. Discrete Random Variable

TODO: Revise this

A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.

If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.

Def. Probability Mass Function

The probability mass function (PMF) of a discrete r.v. X is the function p_X given by

p_X (x) = P(X=x)

Note that this is positive if x is in the support of X and 0 otherwise.

Remark. Notation

We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.

Thm. Valid PMFs

TODO: Rewrite

Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:

3. Discrete Distributions

Def. Bernoulli Distribution

Wikipedia: Bernoulli Distribution

A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.

We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.

The parameter p is often called the success probability of the \text{Bern}(p) distribution.

Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.

Def. Indicator Random Variable

The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.

Note that I_A \thicksim \text{Bern}(p) with p = P(A).

Def. Binomial Distribution

Todo: Rewrite

Wikipedia: Binomial Distribution

Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.

The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).

Thm. Binomial PMF

If X \thicksim \text{Bin}(n, p), then the PMF of X is

P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}

for k \in \N. and k \leq n. If k > n, then P(X=k)=0.

Thm. ~

Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).

Thm. ~

Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that

P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)

for all j \geq 0.

Thm. Hypergeometric Distribution

Todo: Further define

Wikipedia: Hypergeometric Distribution

If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is

P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}

for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.

Thm. ~

The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.

Def. Discrite Uniform Distribution

Wikipedia: Discrete Uniform Distribution

The PMF of X \thicksim \text{DUnif}(C) is

P(X=x) = \dfrac{1}{|C|}

for x \in C and 0 otherwise.

Def. Cumulative Distribution Function

The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where

F_X(x) =P(X \leq x)

Thm. Valid CDFs

For any CDF F_X, or simply F, we have

Def. Function of a Random Variable

For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).