Probability Theory

For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.

References

All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
Introduction to Probability, 2nd Ed. by Joseph K. Blitzstein and Jessica Hwang

Notation

0 \in \N and \N^+ = \N \setminus \{0\}.

1. Probability

Def. Probability

A probability space consists of a sample space \Omega and a probability function (or probability distribution or probability measure) P maps event A \subseteq \Omega to P(A) \in [0, 1]. The function P must satisfy the following axioms:

P(\varnothing) = 0 and P(\Omega)=1.
If A_1, A_2 ... are disjoint events (mutually exclusive), then P\left(\bigcup_{j} A_j\right) = \sum_{j} P(A_j)

Elements \Omega are called sample outcomes, realizations, or elements.

So basically an event is a subset of the sample space and the probability measure satisfies some simple yet very powerful axioms.

Thm. Basic Probability Properties

For any events A and B:

P(A^c) = 1 - P(A).
A \subseteq B \implies P(A) \leq P(B).
P(A \cup B) = P(A) + P(B) - P(A \cap B)

Def. Monotone Increase/Decrease

A sequence of sets A_1, A_2, ... is said to be monotone increasing if

A_1 \subseteq A_2 \subseteq \cdots

and monotone decreasing if

A_1 \supseteq A_2 \supseteq \cdots

In the former case we define the limit

\lim_{n \to \infty} A_n = \bigcup_{i = 1}^{\infty} A_i

and for the latter case we define

\lim_{n \to \infty} A_n = \bigcap_{i = 1}^{\infty} A_i

Either case is denoted with A_n \to A.

Thm. Continuity of Probabilities

Let A_n \to A, then P(A_n) \to P(A) as n \to \infty.

Def. Uniform Probability Distribution

If the sample space \Omega is finite and if each outcome is equally likely, then

P(A) = \dfrac{|A|}{|\Omega|}

so that P is called the uniform probability distribution.

Thm. Inclusion-Exclusion

\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}

Def. Conditional Probability

Let A and be B be events, the we define the conditional probability of A given B as

P(A \mid B) := \frac{P(A \cap B)}{P(B)}

Thm. Conditional Probability

P(A \cap B) = P(B) P(A \mid B)=P(A) P(B \mid A) P(A_1, ... \>, A_n) = P(A_1) P(A_2 \mid A_1)P(A_3 \mid A_2,A_1) \cdots P(A_n \mid A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2 \mid A_1) P(A_3 \mid A_2,A_1) = P(A_2) P(A_3 \mid A_2) P(A_1 \mid A_2, A_3)

Thm. Bayes’ Theorem

Let A and B events, then we have

P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}

where P(A) is called the prior and P(A \mid B) is called the posterior probability of A.

Thm. Law of Total Probability (LOTP)

Let A_1, ..., A_n partition the sample space S, then

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i) P(A_i)

Thm. Generalized Bayes’ Theorem

Let A_1, ..., A_n be a partition of the sample space \Omega such that each A_i and B has a positive probability, then

P(A_i \mid B) = \dfrac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^n P(B \mid A_j) P(A_j)}

Def. Odds

\text{odds}(A) := \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}

Remark. Conditional Probabilities Are Probabilities

0 \leq P(A \mid E) \leq 1.
P(\varnothing \mid E) = 0 and P(S \mid E) = 1.
If A_1, A_2, ... are disjoint events then P(\bigcup_j A_j \mid E) = \sum_{j} P(A_j \mid E).
P(A^c \mid E) = 1 - P(A \mid E).
(Inclusion-Exclusion) P(A \cup B \mid E) = P(A \mid E) + P(B \mid E) - P(A \cap B \mid E).

So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.

Thm. Bayes with Extra Condition

Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have

P(A \mid B, E) = \frac{P(B \mid A, E) P(A \mid E)}{P(B \mid E)}

Thm. LOTP with Extra Condition

Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then

P(B \mid E) = \sum_{i=1}^{n} P(B \mid A_i, E)P(A_i \mid E)

Def. Independence

Two events A and B are called independent if (and only if)

P(A \cap B) = P(A)P(B)

Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.
Therefore, just recall that “disjoint events with positive probability are not independent”.

Thm. TFAE

The following are equivalent if P(A)>0 and P(B)>0,

A and B are independent.
P(A \mid B)=P(A).
P(B \mid A)=P(B).

So, knowing A gives us no information about B. This may not be the case with disjointness.

Thm. Indepence of

If A and B are independent events then so are

A and B^c,
A^c and B,
A^c and B^c.

Def. 3-independence

Events A, B and C are independent if

$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}

Beware that pairwise independence does not imply independence!

Def. n-independence

Events A_1, ..., A_n are independent if they are:

pairwise independent,
triplewise independent,
quadruplewise independent,
…
P(A \cap ... \cap A_n) = P(A_1) ... P(A_n).

Def. Conditional Independence

Event A and B are called conditionally independent for event E if

P(A \cap B \mid E) = P(A \mid E) \> P(B \mid E)

Independence does not imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.

Remarks

TODO: Sigma Algebra, Field and Borel (p. 26)

2. Random Variables

Def. Random Variable (R.V.)

A random variable X is a just map (function)

X: \Omega \to \R

which maps the elements (called outcomes) of \Omega to the real line \R.

Technically it must be a measurable function.

Notation. Random Variable

Def. Cumulative Distribution Function (CDF)

Given a random variable X, the cumulative distribution function F_X is defined as

\def\arraystretch{1.25} \begin{array}{rcl} F_X: \enspace \R &\to& [0, 1] \subseteq \R \\ x &\mapsto& P(X \leq x) \end{array}

Def. Discrete Random Variable

TODO: Revise this

A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.

If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.

Def. Probability Mass Function

The probability mass function (PMF) of a discrete r.v. X is the function p_X given by

p_X (x) = P(X=x)

Note that this is positive if x is in the support of X and 0 otherwise.

Remark. Notation

We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.

Thm. Valid PMFs

TODO: Rewrite

Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:

p_X(x) > 0 if x = x_j and p_X(x) = 0 otherwise.
\sum_{j=1} p_X(x_j) = 1

3. Discrete Distributions

Def. Bernoulli Distribution

Wikipedia: Bernoulli Distribution

A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.

We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.

The parameter p is often called the success probability of the \text{Bern}(p) distribution.

Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.

Def. Indicator Random Variable

The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.

Note that I_A \thicksim \text{Bern}(p) with p = P(A).

Def. Binomial Distribution

Todo: Rewrite

Wikipedia: Binomial Distribution

Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.

The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).

Thm. Binomial PMF

If X \thicksim \text{Bin}(n, p), then the PMF of X is

P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}

for k \in \N. and k \leq n. If k > n, then P(X=k)=0.

Thm. ~

Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).

Thm. ~

Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that

P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)

for all j \geq 0.

Thm. Hypergeometric Distribution

Todo: Further define

Wikipedia: Hypergeometric Distribution

If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is

P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}

for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.

Thm. ~

The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.

Def. Discrete Uniform Distribution

Wikipedia: Discrete Uniform Distribution

The PMF of X \thicksim \text{DUnif}(C) is

P(X=x) = \dfrac{1}{|C|}

for x \in C and 0 otherwise.

Def. Cumulative Distribution Function

The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where

F_X(x) =P(X \leq x)

Thm. Valid CDFs

For any CDF F_X, or simply F, we have

x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
F(a) = \lim_{x \to a^+} F(x)
\lim_{x \to - \infty} F(x) = 0
\lim_{x \to \infty} F(x) = 1

Def. Function of a Random Variable

For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).

Table of Contents

Probability Theory

References

Notation

1. Probability

Def. Probability

Thm. Basic Probability Properties

Def. Monotone Increase/Decrease

Thm. Continuity of Probabilities

Def. Uniform Probability Distribution

Thm. Inclusion-Exclusion

Def. Conditional Probability

Thm. Conditional Probability

Thm. Bayes’ Theorem

Thm. Law of Total Probability (LOTP)

Thm. Generalized Bayes’ Theorem

Def. Odds

Remark. Conditional Probabilities Are Probabilities

Thm. Bayes with Extra Condition

Thm. LOTP with Extra Condition

Def. Independence

Thm. TFAE

Thm. Indepence of

Def. 3-independence

Def. n-independence

Def. Conditional Independence

Remarks

2. Random Variables

Def. Random Variable (R.V.)

Notation. Random Variable

Def. Cumulative Distribution Function (CDF)

Def. Discrete Random Variable

Def. Probability Mass Function

Remark. Notation

Thm. Valid PMFs

3. Discrete Distributions

Def. Bernoulli Distribution

Def. Indicator Random Variable

Def. Binomial Distribution

Thm. Binomial PMF

Thm. ~

Thm. ~

Thm. Hypergeometric Distribution

Thm. ~

Def. Discrete Uniform Distribution

Def. Cumulative Distribution Function

Thm. Valid CDFs

Def. Function of a Random Variable