Preliminaries

For now this note, unlike my others, has only become a simple summary for the primary reference below. So if you are trying to learn the topic, reading the reference might a better idea.

Resources Used

All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
Introduction to Probability, 2nd Ed. by Joseph K. Blitzstein and Jessica Hwang

Notation

0 \in \N and \N^+ = \N \setminus \{0\}.

1. Probability

Def. Probability

A probability space consists of a sample space \Omega and a probability function (or probability distribution or probability measure) P maps event A \subseteq \Omega to P(A) \in [0, 1]. The function P must satisfy the following axioms:

P(\varnothing) = 0 and P(\Omega)=1.
If A_1, A_2 ... are disjoint events (mutually exclusive), then

P\left(\bigcup_{j} A_j\right) = \sum_{j} P(A_j)

Elements \Omega are called sample outcomes, realizations, or elements.

So basically an event is a subset of the sample space and the probability measure satisfies some simple yet very powerful axioms.

Thm. Basic Probability Properties

For any events A and B:

P(A^c) = 1 - P(A).
A \subseteq B \implies P(A) \leq P(B).
P(A \cup B) = P(A) + P(B) - P(A \cap B)

Def. Monotone Increase/Decrease

A sequence of sets A_1, A_2, ... is said to be monotone increasing if

A_1 \subseteq A_2 \subseteq \cdots

and monotone decreasing if

A_1 \supseteq A_2 \supseteq \cdots

In the former case we define the limit

\lim_{n \to \infty} A_n = \bigcup_{i = 1}^{\infty} A_i

and for the latter case we define

\lim_{n \to \infty} A_n = \bigcap_{i = 1}^{\infty} A_i

Either case is denoted with A_n \to A.

Thm. Continuity of Probabilities

Let A_n \to A, then P(A_n) \to P(A) as n \to \infty.

Def. Uniform Probability Distribution

If the sample space \Omega is finite and if each outcome is equally likely, then

P(A) = \dfrac{|A|}{|\Omega|}

so that P is called the uniform probability distribution.

Thm. Inclusion-Exclusion

\begin{array}{llll} P \left(\> \bigcup_{i=1}^{n} A_i \right) =& +& \sum_{i} & P(A_i) \\ &-& \sum_{i \> < \> j} & P(A_i \cap A_j) \\ &+& \sum_{i \> < \> j \> < k} & P(A_i \cap A_j \cap A_k)\\ &\cdots & (-1)^{n+1} & P(A_1 \cap ... \cap A_n) \end{array}

Def. Conditional Probability

Let A and be B be events, the we define the conditional probability of A given B as

P(A \mid B) := \frac{P(A \cap B)}{P(B)}

Thm. Conditional Probability

P(A \cap B) = P(B) P(A \mid B)=P(A) P(B \mid A) P(A_1, ... \>, A_n) = P(A_1) P(A_2 \mid A_1)P(A_3 \mid A_2,A_1) \cdots P(A_n \mid A_{n-1}, ... \> , A_1) P(A_1, A_2, A_3) = P(A_1) P(A_2 \mid A_1) P(A_3 \mid A_2,A_1) = P(A_2) P(A_3 \mid A_2) P(A_1 \mid A_2, A_3)

Thm. Bayes' Theorem

Let A and B events, then we have

P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}

where P(A) is called the prior and P(A \mid B) is called the posterior probability of A.

Thm. Law of Total Probability (LOTP)

Let A_1, ..., A_n partition the sample space S, then

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i) P(A_i)

Thm. Generalized Bayes' Theorem

Let A_1, ..., A_n be a partition of the sample space \Omega such that each A_i and B has a positive probability, then

P(A_i \mid B) = \dfrac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^n P(B \mid A_j) P(A_j)}

Def. Odds

\text{odds}(A) := \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)} \implies P(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}

Remark. Conditional Probabilities Are Probabilities

0 \leq P(A \mid E) \leq 1.
P(\varnothing \mid E) = 0 and P(S \mid E) = 1.
If A_1, A_2, ... are disjoint events then P(\bigcup_j A_j \mid E) = \sum_{j} P(A_j \mid E).
P(A^c \mid E) = 1 - P(A \mid E).
(Inclusion-Exclusion) P(A \cup B \mid E) = P(A \mid E) + P(B \mid E) - P(A \cap B \mid E).

So, the conditional probability is also a probability. Similarly, we can see probability as a conditional probability.

Thm. Bayes with Extra Condition

Provided P(A \cap E) > 0 and P(B \cap E) > 0 we have

P(A \mid B, E) = \frac{P(B \mid A, E) P(A \mid E)}{P(B \mid E)}

Thm. LOTP with Extra Condition

Let A_1, ..., A_n partition S and P(A_i \cap E) > 0 for all i, then

P(B \mid E) = \sum_{i=1}^{n} P(B \mid A_i, E)P(A_i \mid E)

Def. Independence

Two events A and B are called independent if (and only if)

P(A \cap B) = P(A)P(B)

Note that independence is completely different from disjointness. Disjoint events A and B can be independent only if P(A)=0 or P(B)=0.
Therefore, just recall that "disjoint events with positive probability are not independent".

Thm. TFAE

The following are equivalent if P(A)>0 and P(B)>0,

A and B are independent.
P(A \mid B)=P(A).
P(B \mid A)=P(B).

So, knowing A gives us no information about B. This may not be the case with disjointness.

Thm. Indepence of

If A and B are independent events then so are

A and B^c,
A^c and B,
A^c and B^c.

Def. 3-independence

Events A, B and C are independent if

\begin{array}{ll} P(A \cap B) &= P(A) P(B) \\ P(A \cap C) &= P(A) P(C) \\ P(B \cap C) &= P(B) P(C) \\ P(A \cap B \cap C) &= P(A) P(B) P(C) \end{array}

Beware that pairwise independence does not imply independence!

Def. n-independence

Events A_1, ..., A_n are independent if they are:

pairwise independent,
triplewise independent,
quadruplewise independent,
...
P(A \cap ... \cap A_n) = P(A_1) ... P(A_n).

Def. Conditional Independence

Event A and B are called conditionally independent for event E if

P(A \cap B \mid E) = P(A \mid E) \> P(B \mid E)

Independence does not imply conditional independence and vice versa. Also, if A and B is conditionally independent for E, it may not be the case for E^c.

Remarks

TODO: Sigma Algebra, Field and Borel (p. 26)

2. Random Variables

Def. Random Variable (R.V.)

A random variable X is a just map (function)

X: \Omega \to \R

which maps the elements (called outcomes) of \Omega to the real line \R.

Technically it must be a measurable function.

Notation. Random Variable

Def. Cumulative Distribution Function (CDF)

Given a random variable X, the cumulative distribution function F_X is defined as

\def\arraystretch{1.25} \begin{array}{rcl} F_X: \enspace \R &\to& [0, 1] \subseteq \R \\ x &\mapsto& P(X \leq x) \end{array}

Def. Discrete Random Variable

TODO: Revise this

A random variable X is said to be discrete if there is a countable (finite or countably infinite) list of values a_1, a_2, ... such that P(X=a_j \enspace \text{for some} \enspace j)=1.

If X is discrete r.v., then the countable set of values x such that P(X=x) > 0 is called the support of X.

Def. Probability Mass Function

The probability mass function (PMF) of a discrete r.v. X is the function p_X given by

p_X (x) = P(X=x)

Note that this is positive if x is in the support of X and 0 otherwise.

Remark. Notation

We use X = x to denote the event \{s \in S \> | \> X(s) = x \}. We cannot take the probability of a random variable, only of an event.

Thm. Valid PMFs

TODO: Rewrite

Let X be a discrete random variable with countable support x_1, x_2, ... (where each x_i is distinct for notational simplicity). The PMF p_X of X must satisfy the following:

p_X(x) > 0 if x = x_j and p_X(x) = 0 otherwise.
\sum_{j=1} p_X(x_j) = 1

3. Discrete Distributions

Def. Bernoulli Distribution

Wikipedia: Bernoulli Distribution

A random variable X is said to have the Bernoulli Distribution with parameter p if P(X = 1) = p and P(X = 0) = 1 - p, where 0 < p < 1.

We write this as X \thicksim \text{Bern}(p). The symbol \thicksim is read as "is distributed as".

The parameter p is often called the success probability of the \text{Bern}(p) distribution.

Any random variable whose possible values are 0 and 1 has a \text{Bern}(p) distribution.

Def. Indicator Random Variable

The indicator random variable or Bernoulli random variable of an event A is the random variable which equals 1 if A occurs and 0 otherwise. We will denote the indicator random variable of A by I_A.

Note that I_A \thicksim \text{Bern}(p) with p = P(A).

Def. Binomial Distribution

Todo: Rewrite

Wikipedia: Binomial Distribution

Suppose n independent Bernoulli trials are performed, each with the same success probability p. Let the random variable X be the number of successes.

The distribution of X is called the Binomial distribution with parameters n \in \N^+ and p \in [0, 1] denoted by X \thicksim \text{Bin}(n, p).

Thm. Binomial PMF

If X \thicksim \text{Bin}(n, p), then the PMF of X is

P(X = k) = \dbinom{n}{k} p^k (1-p)^{n-k}

for k \in \N. and k \leq n. If k > n, then P(X=k)=0.

Thm. ~

Let X \thicksim \text{Bin}(n, p) and q = 1 -p. Then n - X \thicksim \text{Bin}(n, q).

Thm. ~

Let X \thicksim \text{Bin}(n, \frac{1}{2}) and n even. Then the distribution of X is symmetric about \frac{n}{2} such that

P(X = \frac{n}{2} + j) = P(X = \frac{n}{2} - j)

for all j \geq 0.

Thm. Hypergeometric Distribution

Todo: Further define

Wikipedia: Hypergeometric Distribution

If X \thicksim \text{HGeom}(w, b, n), then the PMF of X is

P(X = k) = \dfrac{\dbinom{w}{k} \dbinom{b}{n- k}}{\dbinom{w+b}{n}}

for 0 \leq k \leq w and 0 \leq n-k \leq b, and P(X=k)=0 otherwise.

Thm. ~

The distributions \text{HGeom}(w, b, n) and \text{HGeom}(n, w + b -n, w) are identical.

Def. Discrete Uniform Distribution

Wikipedia: Discrete Uniform Distribution

The PMF of X \thicksim \text{DUnif}(C) is

P(X=x) = \dfrac{1}{|C|}

for x \in C and 0 otherwise.

Def. Cumulative Distribution Function

The cumulative distribution function of an random variable X (not necessarily discrete) is the function F_X where

F_X(x) =P(X \leq x)

Thm. Valid CDFs

For any CDF F_X, or simply F, we have

x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
F(a) = \lim_{x \to a^+} F(x)
\lim_{x \to - \infty} F(x) = 0
\lim_{x \to \infty} F(x) = 1

Def. Function of a Random Variable

For a random variable X in the sample space S and a function h: \R \to \R the random variable h(X) maps s \in S to h(X(s)).