Table of Contents

Probability Theory

For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.

References

Notation

1. Probability

Def. Probability

A probability space consists of a sample space \Omega and a probability function (or probability distribution or probability measure) P maps event A \subseteq \Omega to P(A) \in [0, 1]. The function P must satisfy the following axioms:

Elements \Omega are called sample outcomes, realizations, or elements.

So basically an event is a subset of the sample space and the probability measure satisfies some simple yet very powerful axioms.

Thm. Basic Probability Properties

For any events A and B:

Def. Monotone Increase/Decrease

A sequence of sets A_1, A_2, ... is said to be monotone increasing if

A_1 \subseteq A_2 \subseteq \cdots

and monotone decreasing if

A_1 \supseteq A_2 \supseteq \cdots

In the former case we define the limit

\lim_{n \to \infty} A_n = \bigcup_{i = 1}^{\infty} A_i

and for the latter case we define

\lim_{n \to \infty} A_n = \bigcap_{i = 1}^{\infty} A_i

Either case is denoted with A_n \to A.

Thm. Continuity of Probabilities

Let A_n \to A, then P(A_n) \to P(A) as n \to \infty.

Def. Uniform Probability Distribution

If the sample space \Omega is finite and if each outcome is equally likely, then

P(A) = \dfrac{|A|}{|\Omega|}

so that P is called the uniform probability distribution.

Thm. Inclusion-Exclusion

\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}

Def. Conditional Probability

Let A and be B be events, the we define the conditional probability of A given B as

P(A \mid B) := \frac{P(A \cap B)}{P(B)}

Thm. Conditional Probability

P(A \cap B) = P(B) P(A \mid B)=P(A) P(B \mid A) P(A_1, ... \>, A_n) = P(A_1) P(A_2 \mid A_1)P(A_3 \mid A_2,A_1) \cdots P(A_n \mid A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2 \mid A_1) P(A_3 \mid A_2,A_1) = P(A_2) P(A_3 \mid A_2) P(A_1 \mid A_2, A_3)

Thm. Bayes’ Theorem

Let A and B events, then we have

P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}

where P(A) is called the prior and P(A \mid B) is called the posterior probability of A.

Thm. Law of Total Probability (LOTP)

Let A_1, ..., A_n partition the sample space S, then

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i) P(A_i)

Thm. Generalized Bayes’ Theorem

Let A_1, ..., A_n be a partition of the sample space \Omega such that each A_i and B has a positive probability, then

P(A_i \mid B) = \dfrac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^n P(B \mid A_j) P(A_j)}

Def. Odds

\text{odds}(A) := \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}

Remark. Conditional Probabilities Are Probabilities

So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.

Thm. Bayes with Extra Condition

Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have

P(A \mid B, E) = \frac{P(B \mid A, E) P(A \mid E)}{P(B \mid E)}

Thm. LOTP with Extra Condition

Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then

P(B \mid E) = \sum_{i=1}^{n} P(B \mid A_i, E)P(A_i \mid E)

Def. Independence

Two events A and B are called independent if (and only if)

P(A \cap B) = P(A)P(B)

Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.

Therefore, just recall that “disjoint events with positive probability are not independent”.

Thm. TFAE

The following are equivalent if P(A)>0 and P(B)>0,

So, knowing A gives us no information about B. This may not be the case with disjointness.

Thm. Indepence of

If A and B are independent events then so are

Def. 3-independence

Events A, B and C are independent if

$$ \begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}

$$

Beware that pairwise independence does not imply independence!

Def. n-independence

Events A_1, ..., A_n are independent if they are:

Def. Conditional Independence

Event A and B are called conditionally independent for event E if

P(A \cap B \mid E) = P(A \mid E) \> P(B \mid E)

Independence does not imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.

Remarks

TODO: Sigma Algebra, Field and Borel (p. 26)

2. Random Variables

Def. Random Variable (R.V.)

A random variable X is a just map (function)

X: \Omega \to \R

which maps the elements (called outcomes) of \Omega to the real line \R.

Technically it must be a measurable function.

Notation. Random Variable

Def. Cumulative Distribution Function (CDF)

Given a random variable X, the cumulative distribution function F_X is defined as

\def\arraystretch{1.25} \begin{array}{rcl} F_X: \enspace \R &\to& [0, 1] \subseteq \R \\ x &\mapsto& P(X \leq x) \end{array}

Def. Discrete Random Variable

TODO: Revise this

A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.

If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.

Def. Probability Mass Function

The probability mass function (PMF) of a discrete r.v. X is the function p_X given by

p_X (x) = P(X=x)

Note that this is positive if x is in the support of X and 0 otherwise.

Remark. Notation

We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.

Thm. Valid PMFs

TODO: Rewrite

Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:

3. Discrete Distributions

Def. Bernoulli Distribution

Wikipedia: Bernoulli Distribution

A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.

We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as “is distributed as”.

The parameter p is often called the success probability of the \text{Bern}(p) distribution.

Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.

Def. Indicator Random Variable

The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.

Note that I_A \thicksim \text{Bern}(p) with p = P(A).

Def. Binomial Distribution

Todo: Rewrite

Wikipedia: Binomial Distribution

Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.

The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).

Thm. Binomial PMF

If X \thicksim \text{Bin}(n, p), then the PMF of X is

P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}

for k \in \N. and k \leq n. If k > n, then P(X=k)=0.

Thm. ~

Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).

Thm. ~

Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that

P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)

for all j \geq 0.

Thm. Hypergeometric Distribution

Todo: Further define

Wikipedia: Hypergeometric Distribution

If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is

P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}

for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.

Thm. ~

The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.

Def. Discrete Uniform Distribution

Wikipedia: Discrete Uniform Distribution

The PMF of X \thicksim \text{DUnif}(C) is

P(X=x) = \dfrac{1}{|C|}

for x \in C and 0 otherwise.

Def. Cumulative Distribution Function

The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where

F_X(x) =P(X \leq x)

Thm. Valid CDFs

For any CDF F_X, or simply F, we have

Def. Function of a Random Variable

For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).