Probability and Distributions

The Probability Set Function

Sigma-Algebra
Definition

A sigma-algebra $F$ is a Family of Sets satisfying the following properties

$Ω \in F or F \neq = \emptyset$

$\forall A \in F, A^{∁} \in F$

$\forall {A_{i}}_{i = 1}^{\infty} \subset F, i = 1 ⋃ \infty A_{i} \in F$

where $Ω$ is a universal set

Intersection of Sigma-Fields

Consider a set of sigma-fields $F_{i}$ on $Ω$ ,

$F_{1} \cap F_{2}$ is a sigma-field on $Ω$ .

$i = 1 ⋃ \infty F_{i}$ is a sigma-field on $Ω$ .

$⋃ \infty F_{i}$ is a sigma-field on $Ω$

Facts

If a set of subsets $A$ of a universal set $Ω$ is both Pi-System and Lambda-System, then $A$ is a sigma-algebra on $Ω$ .

Link to original

Probability Measure
Definition

The Measure with normality $μ (Ω) = 1$
Link to original

Inclusion-Exclusion Formula
Definition

$∣ ⋃_{i = 1}^{n} A_{i} ∣ = \sum_{i = 1}^{n} ∣ A_{i} ∣ - \sum_{1 \leq i < j \leq n} ∣ A_{i} \cap A_{j} ∣ + \sum_{1 \leq i < j < k \leq n} ∣ A_{i} \cap A_{j} \cap A_{k} ∣ - \dots + (- 1)^{n + 1} ∣ A_{1} \cap \dots \cap A_{n} ∣$
Link to original

Increasing and Decreasing Sequence of Events
Definition

$n \to \infty lim C_{n} = ⎩ ⎨ ⎧ n = 1 ⋃ \infty C_{n} n = 1 ⋂ \infty C_{n} for increasing sequence for decreasing sequence$

Facts

Let ${C_{n}}$ be an increasing sequence of events, then $n \to \infty lim P (C_{n}) = P (n \to \infty lim C_{n}) = P (n = 1 ⋃ \infty C_{n})$

Let ${C_{n}}$ be a decreasing sequence of events, then $n \to \infty lim P (C_{n}) = P (n \to \infty lim C_{n}) = P (n = 1 ⋂ \infty C_{n})$

Link to original

Boole's Inequality
Definition

$P (n = 1 ⋃ \infty C_{n}) \leq n = 1 \sum \infty P (C_{n})$
Link to original

Conditional Probability
Definition

Facts

$P (C_{1} \cap C_{2} \cap C_{3} \cap \dots) = P (C_{1}) P (C_{2} ∣ C_{1}) P (C_{3} ∣ C_{1} \cap C_{2}) P (C_{4} ∣ C_{1} \cap C_{2} \cap C_{3}) \dots$
Link to original

Bayes Theorem
Definition

Discrete Case

$P (C_{j} ∣ C) = \frac{P ( C ∣ C _{j} ) P ( C _{j} )}{P ( C )} = \frac{P ( C ∣ C _{j} ) P ( C _{j} )}{i = 1 \sum k P ( C ∣ C _{i} ) P ( C _{i} )}, j = 1, \dots, k$ where $i = 1 \sum k P (C_{i}) P (C ∣ C_{i}) = P (C)$ by Law of Total Probability

$P (C_{j})$ is called a prior probability, and $P (C_{j} ∣ C)$ is called a posterior probability

Continuous Case

$f (θ ∣ x; γ) = \frac{f ( x ∣ θ ) f ( θ ; γ )}{f ( x )} = \frac{f ( x ∣ θ ) f ( θ ; γ )}{\int f ( x ∣ θ ) f ( θ ; γ ) d θ} \propto f (x ∣ θ) f (θ; γ)$ where $θ$ is not a constant, but an unknown parameter follows a certain distribution with a parameter $γ$ .

$f (θ; γ)$ is called a prior probability, $f (x ∣ θ)$ is called a likelihood, $f (x)$ is called an evidence or marginal likelihood, and $f (θ ∣ x; γ)$ is called a posterior probability

Parameter-Centric Notation

$p (θ ∣ x) = \frac{p ( x ∣ θ ) π ( θ )}{p ( x )}$

Examples

Consider a random variable follows Binomial Distribution $X \sim B (n, θ)$ and a prior distribution follows Beta Distribution $π (θ; γ) \sim Beta (α, β)$ where $γ = (α, β)$ .

The PDFs are defined as $p (x ∣ θ) = (x n) θ^{x} (1 - θ)^{n - x}$ and $π (θ; γ) = \frac{Γ ( α + β )}{Γ ( α ) Γ ( β )} θ^{α - 1} (1 - θ)^{β - 1}$ . Then, by Bayes theorem, $p (θ ∣ x) \propto p (x ∣ θ) π (θ; γ) \propto θ^{x + α - 1} (1 - θ^{n - x + β - 1}) \sim Beta (x + α, n - x + β)$ Under Squared Error Loss, the Bayes Estimator is a mean of the posterior distribution. $\hat{δ}_{Bayes} = \frac{x + α}{n + α + β}$
Link to original

Statistical Independence
Definition

Statistical Independence of Events

Suppose a Probability Space $(Ω, F, P)$ . The events $(E_{α})_{α \in I} \subset F$ are independent if $P (j = 1 ⋂ n E_{α_{j}}) = j = 1 \prod n P (E_{α_{j}}), \forall (α_{j})_{j = 1}^{n} \subset I$

Statistical Independence of Two Random Variables

Rigorous

Suppose a Probability Space $(Ω, F, P)$ . random variables $(X_{α} : (Ω, F) \to (R, R))_{α \in I}$ are independent if $(E_{α} := X_{α}^{- 1} (B_{α}))_{α \in I}$ are independent, $\forall B_{α} \in R$

Casual

Two random variables $X, Y$ are independent if and only if $\forall x, \forall y, F_{X, Y} (x, y) = F_{X} (x) F_{Y} (y)$

Statistical Independence of Random Variables

Mutually Independent

A collection of random variables $X_{1}, X_{2}, \dots, X_{n}$ are mutually independent if and only if $F_{X} (x) = \prod_{i = 1}^{n} F_{X_{i}} (x_{i}), \forall x_{i}$ where $F_{X_{i}} (x_{i})$ is marginal CDF of $X_{i}$

Or equivalently, a collection of random variables $X_{1}, X_{2}, \dots, X_{n}$ are mutually independent if and only if $f_{X} (x) = \prod_{i = 1}^{n} f_{X_{i}} (x_{i}), \forall x_{i}$ where $f_{X_{i}} (x_{i})$ is marginal PDF of $X_{i}$

Pointwise Independent

A collection of random variables $X_{1}, X_{2}, \dots, X_{n}$ are Pointwise independent if and only if $\forall i \neq = j, F_{X_{i}, X_{j}} (x_{i}, x_{j}) = F_{X_{i}} (x_{i}) F_{X_{j}} (x_{j}), \forall x_{i}, x_{j}$ where $F_{X_{i}} (x_{i})$ is marginal CDF of $X_{i}$ and $F_{X_{i}, X_{j}} (x_{i}, x_{j})$ is joint pdf of $X_{i}$ and $X_{j}$ .

Statistical Independence of Stochastic Processes

Facts

Random variables $X, Y$ are independent $\Leftrightarrow \forall x, y, \exists f, g s.t. f_{X, Y} (x, y) = g (x) h (y)$

where $g (x)$ is a function of $x$ only, and $h (y)$ is a function of $y$ only.

Random variables $X, Y$ are independent $\Leftrightarrow \forall x, \forall y, F_{X, Y} (x, y) = F_{X} (x) F_{Y} (y)$

Random variables $X, Y$ are independent $\Leftrightarrow \forall x, \forall y, P (a < X \leq b, c < Y \leq d) = P (a < X \leq b) P (c < Y \leq d)$

Random variables $X, Y$ are independent $\Rightarrow E [u (X) v (Y)] = E [u (X)] E [v (Y)]$

If Moment Generating Function $M (t_{1}, t_{2})$ exists and Random variables $X, Y$ are independent $\Leftrightarrow M (t_{1}, t_{2}) = M (t_{1}, 0) M (0, t_{2})$

Mutually independence $\times ⇌ \circ$ pairwise independence

If $X$ is mutually independent, then $E [\prod_{i = 1}^{n} g_{i} (X_{i})] = \prod_{i = 1}^{n} E [g_{i} (X_{i})]$

If random variables $(X_{α})_{α \in I}$ are independent, $f_{α} : R \to R$ are Borel measurable, so are $(f_{α} (X_{α}))$

Let Random variables $X, Y$ . $X$ and $Y$ are independent $\times ⇌ \circ Cov (X, Y) = 0$

If $X$ and $Y$ follow normal distributions, $X$ and $Y$ are independent $\Leftrightarrow Cov (X, Y) = 0$

Link to original

Random Variables

Random Variable
Definition

$X : Ω \to R s.t. \forall B \in R, \exists ω \in Ω, s.t. X^{- 1} (B) = {ω : X (ω) \in B} \in F$

A random variable is a function $X : Ω \to R$ whose inverse function $X^{- 1} (B)$ is $F$ -measurable for the two measurable spaces $(Ω, F)$ and $(R, R)$ .

The inverse image of an arbitrary Borel set of Codomain $R$ is an element of sigma field $F$ .

Notations

Consider a probability space $(Ω, F, P)$

Outcomes: $H, T$

Set of outcomes (Sample space): $Ω = {H, T}$

Events: $\emptyset, {H}, {T}, Ω$

Set of events (Sigma-Field): $F = {\emptyset, {H}, {T}, {H, T}}$

Probabilities: $P : F \to [0, 1]$

Random variable: $X : Ω \to R$

For a random variable $X$ on a Probability Space $(Ω, F, P)$

$X \sim μ_{X}$ if and only if the Distribution of $X$ is $μ_{X}$

$X \sim F_{X}$ if and only if the Distribution Function of $X$ is $F_{X}$

For a random variable $X$ on Probability Space $(Ω_{X}, F_{X}, P_{X})$ and another random variable $Y$ on Probability Space $(Ω_{Y}, F_{Y}, P_{Y})$

$X = d Y ⟺ \forall B \in R : μ_{X} (B) = μ_{Y} (B)$

$X = d Y ⟺ \forall k \in R : F_{X} (k) = F_{Y} (k)$

$X = d Y ⟺ \forall k \in R : P_{X} (X \leq k) = P_{Y} (Y \leq k)$

Link to original

Transclude of Density-Function

Distribution Function
Definition

$F_{X} (x) : R \to [0, 1] := μ_{X} ((- \infty, x]) = P (X \leq x)$

A distribution function $F_{X}$ is a function for the Random Variable on Probability Space $(Ω, F, P)$

Facts

$(Ω, F, P) \sim (R, R, μ_{X})$

Proof By definition, $F_{X} (x) = μ_{X} ((- \infty, x])$ So, defining $\forall x \in R, F_{X} (x)$ is the Equivalence Relation to defining $A = {(- \infty, x] : x \in R}, μ_{X} : A \to [0, 1]$

Since $A$ is a Pi-System, $μ_{X} : σ (R) = R \to [0, 1]$ is uniquely determined by $μ_{X} : A \to [0, 1]$ by Extension from Pi-System

Now, $(R, R, μ_{X})$ is a Measurable Space induced by $X$ . Therefore, defining $μ_{X}$ on $(R, R)$ is equivalent to the defining $P$ on $(Ω, F)$

Distribution function(CDF) $F : R \to [0, 1]$ has the following properties

Monotonic increasing: $\forall a, b, \in R, a < b \Rightarrow F (a) \leq F (b)$

$x \to - \infty lim F (x) = 0$

$x \to \infty lim F (x) = 1$

Right-continuous: $x \to x_{0}^{+} lim F (x) = F (x_{0})$

If a function $F : R \to [0, 1]$ satisfies the following properties, then $F$ is a distribution function(CDF) of some Random Variable $X$

Monotonic increasing: $\forall a, b, \in R, a < b \Rightarrow F (a) \leq F (b)$

$x \to - \infty lim F (x) = 0$

$x \to \infty lim F (x) = 1$

Right-continuous: $x \to x_{0}^{+} lim F (x) = F (x_{0})$

$\forall a, b \in R, a < b \Rightarrow (P (a < X \leq b) = F_{X} (b) - F_{X} (a))$

Link to original

Transformation of Random Variable
Definition

$Y = g (X)$

A new Random Variable $Y$ can be defined by applying a function $g$ to the outcomes of a Random Variable $X$

Discrete

Transformation Technique

Suppose $g : X \to Y$ is Bijective function $P_{Y} (y) = P (Y = y) = p (g (X) = y) = P (X = g^{- 1} (y)) = P_{X} (g^{- 1} (y))$

Continuous

Let $f_{X} (x)$ be the PDF of Random Variable $X$ with a support $S_{X}$ , $g : X \to Y$ be a differentiable Bijective function, and $x = g^{- 1} (y)$

Transformation Technique

$f_{Y} (y) = f_{X} (g^{- 1} (y)) \frac{d x}{d y}, y \in S_{Y}$ where $S_{Y} = {y = g (x) ∣ x \in S_{X}}$

CDF Technique

If $g$ is increasing, then $F_{Y} (y) = P (Y \leq y) = P (g (X) \leq y) = P (X \leq g^{- 1} (y)) = F_{X} (g^{- 1} (y))$

If $g$ is decreasing, then $F_{Y} (y) = P (Y \leq y) = P (g (X) \leq y) = P (X \geq g^{- 1} (y)) = 1 - F_{X} (g^{- 1} (y))$
Link to original

Expectation of a Random Variable

Expected Value
Definition

$E (X) = \int_{Ω} X d P = \int_{R} x d μ_{X} = \int_{R} x d F_{X}$ The expected value of the Random Variable $X$ on Probability Space $(Ω, F, P)$

Continuous

$E (X) = \int_{- \infty}^{\infty} x f_{X} (x) d x$ where $f_{X} (x)$ is a PDF of Random Variable $X$

The expected value of the Random Variable $X$ when $F_{X}$ satisfies absolute continuous over $λ$ , $μ_{X} << λ$

Discrete

$E (X) = \sum_{x} x p_{X} (x)$ where $p_{X} (x)$ is a PMF of Random Variable $X$

The expected value of the Random Variable $X$ when $F_{X}$ satisfies absolute continuous over $#$ , $μ_{X} << #_{X}$ In other words, $F_{X}$ has at most countable jumps.

Expected Value of a Function

$E (g (x)) = \int_{R} g (x) d μ_{X}$

Continuous

$E (g (x)) = \int_{R} g (x) f_{X} (x) d x$

Discrete

$E (g (x)) = x \in S_{X} \sum g (x) p_{X} (x)$

Properties

Linearity

Random Variables

If $\exists k_{i} E [X_{i}]$ , then $E [\sum_{i = 1}^{n} k_{i} X_{i}] = \sum_{i = 1}^{n} k_{i} E (X_{i})$

Matrix of Random Variables

Let $W_{1}, W_{2}$ be a $m \times n$ matrices of random variables, $A_{1}, A_{2}$ be $k \times m$ matrices of constants, and $B$ a $n \times l$ matrix of constant. Then, $E [A_{1} W_{1} + A_{2} W_{2}] = A_{1} E [W_{1}] + A_{2} E [W_{2}]$

$E [A_{1} W_{1} B] = A_{1} E [W_{1}] B$

Notations

Expression Discrete Continuous
Expression for the event $2^{Ω}$ and the probability $P$ $\sum_{ω \in Ω} X (ω) \cdot P (ω)$ $\int_{Ω} X (ω) \cdot d P (ω)$
Expression for the measurable space $(R, R)$ and the distribution $μ_{X}$ $\sum_{x \in R} x \cdot f (x)$ $\int_{R} f (x) λ (x) = \int_{R} \frac{d μ _{X}}{d λ} λ (x) = \int_{R} d μ_{X} = \int_{R} d F_{X}$
Link to original

Expression	Discrete	Continuous
Expression for the event $2^{Ω}$ and the probability $P$	$\sum_{ω \in Ω} X (ω) \cdot P (ω)$	$\int_{Ω} X (ω) \cdot d P (ω)$
Expression for the measurable space $(R, R)$ and the distribution $μ_{X}$	$\sum_{x \in R} x \cdot f (x)$	$\int_{R} f (x) λ (x) = \int_{R} \frac{d μ _{X}}{d λ} λ (x) = \int_{R} d μ_{X} = \int_{R} d F_{X}$

Some Special Expectations

Moment
Definition

$μ_{n}^{'} = E (X^{n})$ $n$ -th moment

Facts

$\forall k \leq m, (\exists E (X^{m}) \Rightarrow \exists E (X^{k}))$

Link to original

Central Moment
Definition

$μ_{n} = E [(X - μ)^{n}]$ $n$ -th central moment

A Moment of a random variable about its mean
Link to original

Moment Generating Function
Definition

Univariate

Let $X$ be a Random Variable $M_{X} (t) = E (e^{tX}) = i = 0 \sum \infty \frac{E ( X ^{i} )}{i !} t^{i}$

Then, $M_{X} (t)$ is called the moment generating function of the $X$

Calculations of Moments

$μ_{n}^{'} = E (X^{n}) = M_{X}^{(n)} (0)$

Moment can be calculated using the mgf

Uniqueness of MGF

Let $X, Y$ be random variables with mgf $M_{X} (t), M_{Y} (t)$ , respectively Then, $(\forall z \in R, F_{X} (z) = F_{Y} (z)) \Leftrightarrow (\forall t \in (- h, h), M_{X} (t) = M_{Y} (t))$ where $h > 0$

Let $X$ be a Random Variable If $\exists M_{X} (t), \forall t s.t. ∣∣ t ∣∣ \leq t_{0}$ , where $t_{0} \in R^{+}$ , then $M_{X} (t)$ determine the distribution uniquely.

Multivariate

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be a Random Vector $M_{X} (t) = E (e^{t^{'} X})$ where $t = (t_{1}, \dots, t_{n})$ is a constant vector.

Calculations of Multivariate Moments

$\frac{\partial ^{k + m} M ( t _{1} , t _{2} )}{\partial t _{1}^{k} \partial t _{2}^{m}}_{t_{1} = t_{2} = 0} = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} x^{k} y^{m} f (x, y) d x d y = E (X^{k} Y^{m})$

Moment can be calculated using the mgf

Marginal MGF

$M_{X} (t_{1}, 0, \dots, 0) = M_{X_{1}} (t_{1})$

Facts

The mgf may not exist.

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be a Random Vector. If $M_{X} (t)$ exists and can be expressed as $M_{X} (t) = M_{X} (t_{1}, \dots, t_{r}, 0, \dots, 0) M_{X} (0, \dots, 0, t_{r + 1}, \dots, t_{n})$ then $X_{1} = (X_{1}, \dots, X_{r})^{⊺}$ and $X_{2} = (X_{r + 1}, \dots, X_{n})^{⊺}$ are statistically independent.

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be a Random Vector, $X_{1} = (X_{1}, \dots, X_{r})^{⊺}$ , and $X_{2} = (X_{r + 1}, \dots, X_{n})^{⊺}$ , then $X_{1}$ and $X_{2}$ is independent $\Leftrightarrow \exists a, b s.t. M_{X} (t) = a (t_{1}, \dots, t_{r}) b (t_{r + 1}, \dots, t_{n})$ where $a$ and $b$ are functions.

Link to original

Characteristic Function
Definition

$φ (t) = E (e^{i tX}) = E [cos (tX) + i sin (tX)]$
Link to original

Cumulant Generating Function
Definition

$K_{X} (t) = ln M_{X} (t) = ln E (e^{tX})$

Calculations of Cumulants

$κ_{n} = K^{(n)} (0)$

the $n$ -th cumulant can be calculated using cgf $κ_{n} = K^{(n)} (0)$
Link to original

Important Inequalities

Markov's Inequality
Definition

Let $X$ be a non-negative Random Variable, and $E (X) < \infty$ , then $\forall a > 0, P (X \geq a) \leq \frac{E ( X )}{a}$ where $\frac{E ( X )}{a}$ serves as an upper bound for the probability $P (X \geq a)$

Extended Version for Non-negative Functions

Let $u (X)$ be a non-negative Function of a Random Variable $X$ , and $E (X) < \infty$ , then $\forall a > 0, P [u (X) \geq a] \leq \frac{E [ u ( X )]}{a}$ where $\frac{E [ u ( X )]}{a}$ serves as an upper bound for the probability $P [u (X) \geq a]$
Link to original

Chebyshev's Inequality
Definition

$\forall k \in R, P (∣ X - μ ∣ \geq kσ) \leq \frac{1}{k ^{2}}$
Link to original

Jensen's Inequality
Definition

Let $f$ be a Convex Function on an interval $I$ , $X$ be a Random Variable with support $S \subset I$ , and $E (X) < \infty$ , then $f (E [X]) \leq E [f (X)]$

Facts

$\frac{1}{\frac{1}{n} \sum \frac{1}{a _{i}}} \leq (i = 1 \prod n a_{i})^{1/ n} \leq \frac{1}{n} i = 1 \sum n a_{i}$

By Jensen’s Inequality, the following relation is satisfied arithmetic mean $\leq$ geometric mean $\leq$ harmonic mean

Link to original

Multivariate Distribution

Random Vector
Definition

$X : (Ω, F) \to (R^{d}, R^{d}), X = (X_{1}, X_{2}, \dots, X_{d})$

Column vector whose components are random variables $X : (Ω, F) \to (R, R)$ .

$X (ω) = (X_{1}, X_{2}, \dots, X_{d}) (ω) = (X_{1} (ω), X_{2} (ω), \dots, X_{d} (ω)) = (x_{1}, x_{2}, \dots, x_{d})$
Link to original

Multivariate Distribution
Definition

Joint CDF

The Distribution Function $F_{X} : R^{n} \to [0, 1]$ of a Random Vector $X = (X_{1}, \dots, X_{n})^{⊺}$ is defined as $F_{X} (x) = P (X \leq x) = P (X_{1} \leq x_{1}, \dots, X_{n} \leq x_{n})$ where $x = (x_{1}, \dots, x_{n})^{⊺}$

Joint PDF

$p_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = P (X_{1} = x_{1}, \dots, X_{n} = x_{n})$

The joint probability Distribution Function of $n$ discrete Random Vector $X_{1}, X_{2}, \dots, X_{n}$

$f_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = \frac{\partial ^{n} F _{X_{1}, \dots, X_{n}} ( x _{1} , \dots , x _{n} )}{\partial x _{1} \dots \partial x _{n}}$

The joint probability Distribution Function of $n$ continuous Random Vector $X_{1}, X_{2}, \dots, X_{n}$

Expected Value of a Multivariate Function

Continuous

$E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} g(\mathbf{x})f(\mathbf{x})dx_{1} \dots \dots dx_{n}$

Discrete

$E [g (X)] = x_{n} \sum \dots x_{1} \sum g (x) f (x)$

Marginal Distribution of a Multivariate Function

Marginal CDF

$F_{X_{1}} (x_{1}) = x_{2}, \dots, x_{n} \to \infty lim F_{X} (x)$

Marginal PDF

$f_{X_{1}}(x_{1}) = E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} f(x_{2}, \dots, x_n)dx_{2} \dots \dots dx_{n}$

Conditional Distribution of a Multivariate Function

$f_{2, \dots, n ∣1} (x_{2}, \dots, x_{n} ∣ x_{1}) = \frac{f _{X} ( x )}{f _{X_{1}} ( x _{1} )}$

$f_{1∣2, \dots, n} (x_{1} ∣ x_{2}, \dots, x_{n}) = \frac{f _{X} ( x )}{f _{X_{2}, \dots, X_{n}} ( x _{2} , \dots , x _{n} )}$

Properties

Linearity

If $\exists E [g_{1} (X_{1}, X_{2})], E [g_{2} (X_{1}, X_{2})]$ , then $E [k_{1} g_{1} (X_{1}, X_{2}) + k_{2} g_{2} (X_{1}, X_{2})] = k_{1} E [g_{1} (X_{1}, X_{2})] + k_{2} E [g_{2} (X_{1}, X_{2})]$
Link to original

Transformation of Random Vector
Definition

$Y = g (X)$

Discrete

Let $X$ be a Random Vector with joint pmf $p_{X} (x)$ and support $S$ , and $g : X \to Y$ be a vector Bijective transformation $p_{Y} (y) = P (Y = y) = p (g (X) = y) = P (X = g^{- 1} (y)) = P_{X} (g^{- 1} (y))$

Continuous

Let $X$ be a Random Vector with joint pdf $f_{X} (x)$ and support $S$ , $g : X \to Y$ be a vector Bijective transformation, and $x = g^{- 1} (y)$

One-to-One Transformation Case

$f_{Y} (y) = f_{X} (g^{- 1} (y)) \cdot ∣ J ∣, y \in S_{Y}$ where $J = \frac{\partial x}{\partial y} = \frac{\partial x _{1}}{\partial y _{1}} ⋮ \frac{\partial x _{m}}{\partial y _{1}} \dots ⋱ \dots \frac{\partial x _{1}}{\partial y _{n}} ⋮ \frac{\partial x _{m}}{\partial y _{n}}$ is a Jacobian Matrix, and $S_{Y} = {y = g (x) ∣ x \in S_{X}}$

Many-to-One Transformation Case

Let $g : X \to Y$ be a many-to-one transformation, $x = g^{- 1} (y)$ , $A_{1}, \dots, A_{k}$ be a partition of a set $S_{X} = ⨄_{i = 1}^{k} A_{i}$ whose each partition’s transformation result is one-to-one Then, $f_{Y} (y) = i = 1 \sum k ∣ J_{i} ∣ f_{X} (g_{i}^{- 1} (y)), y \in S_{Y}$ where $J_{i} = \frac{\partial x _{i}}{\partial y} = \frac{\partial x _{1 i}}{\partial y _{1}} ⋮ \frac{\partial x _{mi}}{\partial y _{1}} \dots ⋱ \dots \frac{\partial x _{1 i}}{\partial y _{n}} ⋮ \frac{\partial x _{mi}}{\partial y _{n}}$ is a Jacobian Matrix for each partition, and $S_{Y} = {y = g (x) ∣ x \in S_{X}}$ , and $⊎$ is a union of the pairwise disjoint sets

CDF Technique

If $g$ is increasing, then $F_{Y} (y) = P (Y \leq y) = P (g (X) \leq y) = P (X \leq g^{- 1} (y)) = F_{X} (g^{- 1} (y))$

If $g$ is decreasing, then $F_{Y} (y) = P (Y \leq y) = P (g (X) \leq y) = P (X \geq g^{- 1} (y)) = 1 - F_{X} (g^{- 1} (y))$
Link to original

Conditional Distribution

Conditional Distribution
Definition

$f_{Y ∣ X} (y ∣ x) := P (Y = y ∣ X = x) = \frac{f _{X, Y} ( x , y )}{f _{X} ( x )}$ where $f_{X} (x) > 0$
Link to original

Conditional Expectation
Definition

Conditional Expectation of a Random Variable

$E (X ∣ Y = y) = \int_{- \infty}^{\infty} x f_{X ∣ Y} (x ∣ y) d x$

Conditional Expectation of a Function

$E [g (x) ∣ Y = y] = \int_{- \infty}^{\infty} g (x) f_{X ∣ Y} (x ∣ y) d x$

Conditional Expectation with respect to a Sub-Sigma-Algebra

Consider a Probability Space $(Ω, F, P)$ , a Random Variable $X : Ω \to R$ , and a sub-Sigma-Algebra $G \subset F$ . A conditional expectation of $X$ given $G$ , denoted as $E (X ∣ G) : Ω \to R$ is a function which satisfies: $\forall G \in G, \int_{G} E (X ∣ G) d P = \int_{G} X d P$

Existence of Conditional Distribution

Define a measure $μ^{X} (F) = \int_{F} X d P$ and its restriction $μ^{X} ∣_{G} := μ^{X} \circ g$ where $g : G \to F, g (x) = x$ , and $P ∣_{G} := P \circ g$ . Then, $μ^{X} ∣_{G} ≪ P ∣_{G}$ ( $μ^{X} ∣_{G}$ is absolute continuous with respect to $P ∣_{G}$ ) and there exists a Radon–Nikodym Theorem $\frac{d μ ^{X} ∣ _{G}}{d P ∣ _{G}} =: E (X ∣ G)$ satisfying $\forall G \in G, μ^{X} ∣_{G} (G) = \int_{G} E (X ∣ G) d P_{G} = \int_{G} X d P = μ^{X} (G)$ and it is called the conditional expectation of $X$ given $G$ .

Conditional Expectation with respect to a Random Variable

Suppose a Probability Space $(Ω, F, P)$ , and random variables $X : Ω \to R$ and $Y : Ω \to R$ . The conditional expectation of $X$ given $Y$ , denoted as $E (X ∣ Y) : Ω \to R$ is defined as $E (X ∣ σ (Y)) = E (X ∣ {Y^{- 1} (B) ∣ B \in R} \subset F)$ where $σ (Y)$ is the sigma-field generated by random variable $Y$

Facts

If $X$ is $G$ -measurable, then $E (X Y ∣ G) = XE (Y ∣ G)$ , where $G \subset F$ is sub-Sigma-Field

Since $X$ is $G$ -measurable, $\exists$ $G$ -measurable function $h$ s.t. $X = h (G)$ By the property of conditional expectation, $E (X Y ∣ G) = E (h (G) Y ∣ G) = h (G) E (Y ∣ G) = XE (Y ∣ G)$ .

Link to original

Conditional Variance
Definition

$Var (Y ∣ X) = E [(Y - E (Y ∣ X))^{2} ∣ X] = E [Y^{2} ∣ X] - E [Y ∣ X]^{2}$
Link to original

Law of Total Expectation
Definition

$E (X) = E_{Y} [E_{X} (X ∣ Y)]$

Proof

By definition of conditional expectation $\forall G \in G, \int_{G} E (X ∣ G) d P = \int_{G} X d P$ Since $G$ is sub-Sigma-Field, $Ω \in G$ and $\int_{Ω} E (X ∣ G) d P = \int_{Ω} X d P$ hold.

Then, By definition of Expected Value $E (X) = \int_{Ω} X d P$ $E (E (X ∣ G)) = \int_{Ω} E (X ∣ G) d P = \int_{Ω} X d P = E (X)$
Link to original

Law of Total Variance
Definition

$Var [E (X ∣ Y)] \leq Var (X) = Var [E (X ∣ Y)] + E [Var (X ∣ Y)]$
Link to original

The Correlation Coefficient

Covariance
Definition

$Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})]$

Properties

Covariance with Itself

$Cov (X, X) = Var (X)$

Covariance of Linear Combinations

$Cov (i = 1 \sum n a_{i} X_{i}, i = 1 \sum n b_{i} Y_{i}) = i = 1 \sum n j = 1 \sum m a_{i} b_{j} Cov (X_{i}, Y_{j})$

$Var (i = 1 \sum n a_{i} X_{i}) = Cov (i = 1 \sum n a_{i} X_{i}, i = 1 \sum n a_{i} X_{i}) = i = 1 \sum n a_{i}^{2} Var (X_{i}) + 2 i, j : i < j \sum a_{i} a_{j} Cov (X_{i}, X_{j})$

Link to original

Pearson Correlation Coefficient
Definition

$ρ_{X, Y} = \frac{cov ( X , Y )}{σ _{X} σ _{Y}} = \frac{E [( X - μ _{X} ) ( Y - μ _{Y} )]}{σ _{X} σ _{Y}}$

Facts

Let $E (Y ∣ X)$ is linear in $X$ , that is $E (Y ∣ X) = a + b X$ , then the Conditional Expectation and Conditional Variance will be $E (Y ∣ X) = μ_{2} + ρ \frac{σ _{2}}{σ _{1}} (X - μ_{1})$ $E [Va r (Y ∣ X)] = σ_{2}^{2} (1 - ρ^{2})$

Link to original

Independent Random Variable

Independent and Identically Distributed Random Variable
Definition

A Random Vector is independent and identically distributed if each Random Variable has the same Distribution and are mutually independent.

Facts

If $X$ is i.i.d. Random Vector, then $E (X) = (E (X_{1}), \dots, E (X_{n}))^{⊺}$

Link to original

Extension to Several Random Variables

Covariance Matrix
Definition

Variance-Covariance Matrix

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be an $n$ dimensional Random Vector with finite variance $Va r (X_{i}) < \infty$ , and $μ = E [X]$ , then the variance-covariance matrix of $X$ is $Var (X) = Cov (X, X) = E [(X - μ) (X - μ)^{⊺}] = E (X X^{⊺}) - E (X) E (X)^{⊺}$

Cross-Covariance Matrix

Let $X = (X_{1}, X_{2}, \dots, X_{n}), Y = (Y_{1}, Y_{2}, \dots, Y_{n})$ be $n$ dimensional random vectors with finite variance $Va r (X_{i}) < \infty, Va r (Y_{i}) < \infty$ , and $μ_{X} = E [X], μ_{Y} = E [Y]$ , then the cross-covariance matrix of $X, Y$ is $Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})^{⊺}] = [Cov (X_{i}, Y_{i})] = E (X Y^{⊺}) - μ_{X} μ_{Y}^{⊺}$

Properties

Let $X$ be a Random Vector with finite variance $Va r (X_{i}) < \infty$ , and $A$ be a matrix of constants, then

$Var (A X) = A Var (X) A^{⊺}$

Let $X$ and $Y$ be random vectors. $Cov (A X, B Y) = A Cov (X, Y) B^{⊺}$ where $A$ and $B$ are matrices of constants.

Facts

Every variance-covariance matrix is Positive Semi-Definite Matrix

Let $X$ be a Random Vector such that no element of $X$ is a linear combination of the remaining elements. $∄ a, b s.t. a X = b$ Then, the Variance-Covariance Matrix of the random vector is Positive-Definite Matrix. $Var (X) ≻ 0$

Link to original

Some Special Distributions

Uniform Distribution
Definition

$X \sim U (a, b)$ where $a$ and $b$ are the minimum and maximum values.

Properties

PDF

$f (x) = ⎩ ⎨ ⎧ \frac{1}{b - a} 0 x \in [a, b] otherwise$

CDF

$F (x) = ⎩ ⎨ ⎧ 0 \frac{x - a}{b - a} 1 x < a x \in [a, b] x > b$

Mean

$E (X) = \frac{a + b}{2}$

Variance

$Var (X) = \frac{( b - a ) ^{2}}{12}$
Link to original

Bernoulli Distribution
Definition

$X \sim Ber (p) = B (1, p)$ where $p \in [0, 1]$ is a probability of success

number of success in a single trial with success probability $p$

Bernoulli Process

The i.i.d. Random Vector $X_{1}, \dots, X_{n}$ with Bernoulli distribution

Properties

PDF

$f (x) = p^{x} (1 - p)^{1 - x}, x \in {0, 1}$

Mean

$E (X) = p$

Variance

$Var (X) = p (1 - p)$
Link to original

Binomial Distribution
Definition

$X \sim B (n, p)$ where $n$ is the length of bernoulli process, and $p \in [0, 1]$ is a probability of success

The number of successes in length $n$ bernoulli process with success probability $p$

Properties

PDF

$f (x) = (x n) p^{x} (1 - p)^{n - x}, x \in {0, 1, \dots, n}$

Mean

$E (X) = n p$

Variance

$Var (X) = n p (1 - p)$

MGF

$M_{X} (t) = [(1 - p) + p e^{t}]^{n}$

Facts

Let $X_{i} \sim B (n_{i}, p)$ be independent random variables following binomial distribution, then $Y = i = 1 \sum n X_{i} \sim B (i = 1 \sum n n_{i}, p)$

Link to original

Negative Binomial Distribution
Definition

$X \sim NB (r, p)$ where $r > 0$ is a number of successes, and $p$ is a probability in each experiment

The number of failures in a bernoulli process before the $r$ -th success

Properties

PDF

$f (y) = (y y + r - 1) p^{r} (1 - p)^{y}, x \in N_{0}$

Mean

$E (X) = \frac{r ( 1 - p )}{p}$

Variance

$Var (X) = \frac{r ( 1 - p )}{p ^{2}}$

MGF

$M_{Y} (t) = p^{r} [1 - (1 - p) e^{t}]^{- r} = (\frac{p}{1 - ( 1 - p ) e ^{t}})^{r}$
Link to original

Geometric Distribution
Definition

$Y \sim Geom (p) = NB (1, p)$ where $p$ is a probability in each experiment

The number of failures before the 1st success

Properties

PDF

$f (y) = p (1 - p)^{y}, y \in N_{0}$

Mean

$E (X) = \frac{1 - p}{p}$

Variance

$Var (X) = \frac{1 - p}{p ^{2}}$
Link to original

Trinomial Distribution
Definition

$X \sim Tri (n, p_{1}, p_{2})$ where $n$ is the number of trials, $p_{1}$ is a probability of category $1$ , and $p_{2}$ is a probability of category $2$

Distribution that expands the outcomes of a binomial distribution to three

Properties

PDF

$f (x, y) = \frac{n !}{x ! y ! ( n - x - y )!} p_{1}^{x} p_{2}^{y} (1 - p_{1} - p_{2})^{n - x - y}$

MGF

$M_{X, Y} (t_{1}, t_{2}) = (p_{1} e^{t_{1}} + p_{2} e^{t_{2}} + p_{3})^{n}$

Marginal PDF

$f_{X} (x) = (x n) p_{1}^{x} (1 - p_{1})^{n - x} \sim B (n, p_{1})$

$f_{Y} (y) \sim B (n, p_{2})$

Conditional PDF

$f_{Y ∣ X} (y ∣ x) = (y n - x) (\frac{p _{2}}{1 - p _{1}})^{y} (1 - \frac{p _{2}}{1 - p _{1}})^{n - x - y} \sim B (n - x, \frac{p _{2}}{1 - p _{1}})$

$f_{X ∣ Y} (x ∣ y) \sim B (n - y, \frac{p _{1}}{1 - p _{2}})$
Link to original

Multinomial Distribution
Definition

$X \sim Mult (n, p_{1}, p_{2}, \dots, p_{k - 1})$ where $n$ is the number of trials, $p_{i}$ is a probability of category $i$

Distribution that describes the probability of observing a specific combination of outcomes

Properties

PDF

$f (x_{1}, x_{2}, \dots, x_{k - 1}) = n! \prod_{i = 1}^{k} \frac{p _{i}^{x_{i}}}{x _{i} !}$ where $x_{k} = n - i = 1 \sum k - 1 x_{i}$ , $p_{k} = 1 - i = 1 \sum k - 1 p_{i}$ , and $0 \leq i = 1 \sum k - 1 x_{i} \leq n$

MGF

$M (t_{1}, t_{2}, \dots, t_{k - 1}) = (\sum_{i = 1}^{k - 1} p_{i} e^{t_{i}} + p_{k})^{n}$

Marginal PDF

Each one-variable marginal pdf is Binomial Distribution, each two-variables marginal pdf is Trinomial Distribution, and so on.
Link to original

Hypergeometric Distribution
Definition

$X \sim Hypergeom (N, K, n)$ where $N$ is the population size, $K$ is the number of success, and $n$ is the number of trials,

Distribution that describes the probability of $k$ successes in $n$ draws, without replacement, from a population $N$ that contains $K$ successes.

Draws without replacement version of Binomial Distribution

Properties

PDF

$f (x) = \frac{( k K ) ( n - k N - K )}{( n N )}, x \in {0, 1, \dots, n}$

Mean

$E (X) = n \frac{K}{N}$

Variance

$Var (X) = n \frac{K}{N} \frac{N - K}{N} \frac{N - n}{N - 1}$
Link to original

Poisson Distribution

Poission Distribution
Definition

$X \sim Pois (λ)$ where $λ$ is the average number of occurrences in a fixed interval of time

The number of occurrences in a fixed interval of time with mean $λ$

Properties

PDF

$f (x) = \frac{e ^{- λ} λ ^{x}}{x !} \in [0, 1]$ where $x \in N_{0}$

Mean

$E (X) = λ$

Variance

$Var (X) = λ$

MGF

$M (t) = exp [m (e^{t} - 1)]$

Summation

Let $X_{i} \sim Pois (λ_{i})$ , and $X_{i}$ ‘s are independent, then $i = 1 \sum n X_{i} \sim Pois (i = 1 \sum n λ_{i})$
Link to original

The Gamma, Chi-Squared, and Beta Distribution

Gamma Function
Definition

$Γ (α) := \int_{0}^{\infty} y^{α - 1} e^{- y} d y$ where $α \in R^{+}$

$Γ (s) := \int_{0}^{1} (- lo g u)^{s - 1} d u$ where $s \in R^{+}$

Gamma function is used as an extension of the factorial function

Properties

$\forall1 < α \in R, Γ (α) = (α - 1) Γ (α - 1)$

$\forall α \in N, Γ (α) = (α - 1)!$

$Γ (\frac{1}{2}) = π$
Link to original

Gamma Distribution
Definition

$X \sim Γ (α, β) = Γ (k, θ)$ where $α = k \in R^{+}$ is the shape parameter, and $β = \frac{1}{θ} \in R^{+}$ is the scale parameter

Distribution that models the waiting time until occurring $α$ th events in a Poisson process with mean $θ = \frac{1}{β}$

Properties

PDF

$f (x) = \frac{x ^{k - 1} e ^{- \frac{x}{θ}}}{Γ ( k ) θ ^{k}} = \frac{β ^{α} x ^{α - 1} e ^{- β x}}{Γ ( α )}, x \in R^{+}$ where $Γ (α)$ is the Gamma Function

Mean

$E (X) = k θ = \frac{α}{β}$

Variance

$Var (X) = k θ^{2} = \frac{α}{β ^{2}}$

MGF

$M (t) = (1 - θt)^{- k} = (1 - \frac{t}{β})^{- α}$ where $t < \frac{1}{θ}$ and $t < β$

Sum of independent Gamma distributions

Let $X_{i} \sim Γ (α_{i}, β)$ ‘s are independent gamma distributions $i = 1 \sum n X_{i} \sim Γ (i = 1 \sum n α_{i}, β)$

Scaling

Let $X \sim Γ (k, θ)$ , then, $\forall c > 0, c X \sim Γ (k, c θ)$

Let $X \sim Γ (α, β)$ , then, $\forall c > 0, c X \sim Γ (α, \frac{β}{c})$

Relationship with Poisson Distribution

Let $W$ be the time until the $k$ -th event in a poisson process with rate $λ$ , then, the cdf and pdf of $W$ is $G (w) = \int_{0}^{w} \frac{λ ^{k} y ^{k - 1} e ^{- λ y}}{Γ ( k )} d y, w \in R^{+}$ $g (w) = \frac{λ ^{k} w ^{k - 1} e ^{- λ w}}{Γ ( k )} \sim Γ (k, \frac{1}{λ}), w \in R^{+}$
Link to original

Exponential Distribution
Definition

$X \sim Exp (λ) = Γ_{α, β} (1, \frac{1}{λ}) = Γ_{k, θ} (1, λ)$ where $λ \in R^{+}$ is the average number of events per fixed unit of time

Time interval between events in a Poisson process with mean $λ$

Properties

PDF

$f (x) = λ e^{- λ x} \in R_{0}^{+}$

CDF

$F (x) = 1 - e^{- λ x}$

Mean

$E (X) = \frac{1}{λ}$

Variance

$Var (X) = \frac{1}{λ ^{2}}$

Memorylessness Property

$\forall x, t \geq 0 : P (T > s + t ∣ T > s) = P (T > t)$
Link to original

Chi-squared Distribution
Definition

$X \sim χ^{2} (k) = Γ (\frac{k}{2}, 2)$ where $k \in N$ is the degrees of freedom

squared sum of independent standard normal distributions

Properties

PDF

$f (x) = \frac{1}{Γ ( \frac{k}{2} ) 2 ^{\frac{k}{2}}} x^{\frac{k}{2} - 1} e^{- \frac{x}{2}} \in R_{0}^{+}$

Mean

$E (X) = k$

Variance

$Var (X) = 2 k$

MGF

$M (t) = (1 - 2 t)^{- k /2}$

Additivity

Let $X_{i} \sim χ^{2} (k_{i})$ ‘s are independent chi-squared distributions $i = 1 \sum n X_{i} \sim χ^{2} (i = 1 \sum n k_{i})$

Facts

Let $Q_{1} \sim χ^{2} (r_{1})$ and $Q_{2} \sim χ^{2} (r_{2})$ , and $Q = Q_{1} - Q_{2}$ is independent of $Q_{2}$ , then $Q \sim χ^{2} (r_{1} - r_{2})$

Let $Q = Q_{1} + Q_{2} + \dots + Q_{k - 1} + Q_{k}$ , where $Q, Q_{1}, Q_{2}, \dots, Q_{k}$ are quadratic forms in $x$ , where each element of the $x$ is a Random Sample from $N (μ, σ^{2})$ If $Q / σ^{2} \sim χ^{2} (r), Q_{1} / σ^{2} \sim χ^{2} (r_{1}), Q_{2} / σ^{2} \sim χ^{2} (r_{2}), \dots, Q_{k - 1} / σ^{2} \sim χ^{2} (r_{k - 1})$ , then

$Q_{1}, Q_{2}, \dots, Q_{k}$ are independent

$Q_{k} / σ^{2} \sim χ^{2} (r - r_{1} - r_{2} - \dots - r_{k - 1})$

Link to original

Beta Distribution
Definition

$X \sim Beta (α, β)$ where $α, β \in R^{+}$ are the shape parameters

Properties

PDF

$f (x) = \frac{Γ ( α + β )}{Γ ( α ) Γ ( β )} x^{α - 1} (1 - x)^{β - 1} \in (0, 1)$

Mean

$E (X) = \frac{α}{α + β}$

Variance

$Var (X) = \frac{α β}{( α + β + 1 ) ( α + β ) ^{2}}$

Derivation from Gamma Distribution

Let $X_{1} \sim Γ (α, 1), X_{2} \sim Γ (β, 1)$ are independent gamma distributions, then $\frac{X _{1}}{X _{1} + X _{2}} \sim Beta (α, β)$
Link to original

Dirichlet Distribution
Definition

$X \sim Dir (a_{1}, a_{2}, \dots, a_{K})$ where $K \in N$ is the number of categories, and $a_{i} \in R^{+}$ ‘s are the concentration parameters

A multivariate generalization of the Beta Distribution

Properties

PDF

$f (x) = \frac{Γ ( \sum _{i = 1}^{K} α _{i} )}{\prod _{i = 1}^{K} Γ ( α _{i} )} \prod_{i = 1}^{K} x_{i}^{α_{i} - 1}$

Facts

If $K = 1$ , then the Dirichlet distribution is a beta distribution.

Link to original

The Normal Distribution

Normal Distribution
Definition

$X \sim N (μ, σ^{2})$ where $μ \in R$ is the location parameter(mean), and $σ^{2} \in R_{0}^{+}$ is the scale parameter(variance)

Standard Normal Distribution

$X \sim N (0, 1)$ $f (x) = \frac{1}{2 π} exp (- \frac{x ^{2}}{2}) \in R$

Properties

PDF

$f (x) = \frac{1}{2 π σ ^{2}} exp (- \frac{1}{2} (\frac{x - μ}{σ})^{2}) \in R$

Mean

$E (X) = μ$

Variance

$Var (X) = σ^{2}$

MGF

$M_{X} (t) = exp (μ t + \frac{σ ^{2} t ^{2}}{2})$

Higher Order Moments

$E (Z^{k}) = {E (Z^{2 k}) = \frac{( 2 k )!}{2 ^{k} k !} E (Z^{2 k + 1}) = 0$

$E (X^{k}) = j = 0 \sum k (j k) σ^{j} E (Z^{j}) μ^{k - j}$

Sum of Normally Distributed Random Variables

Let $X_{i} \sim N (μ_{i}, σ_{i}^{2})$ be independent random variables following normal distribution, then $i = 1 \sum n a_{i} X_{i} \sim N (i = 1 \sum n a_{i} μ_{i}, \sum_{i = 1}^{n} a_{i}^{2} σ_{i}^{2})$

Relationship with Chi-squared Distribution

Let $Z \sim N (0, 1)$ be a standard normal distribution, then $Z^{2} \sim χ^{2} (1)$

Facts

Let $X \sim N (0, σ^{2})$ , then $Var (X^{2}) = 2 σ^{4}$

Link to original

The Multivariate Normal Distribution

Multivariate Normal Distribution
Definition

$X \sim N_{n} (μ, Σ)$ where $n$ is the number of dimensions, $μ$ is the vector of location parameters, and $Σ$ is the vector of scale parameters

Standard Multivariate Normal Distribution

$X \sim N_{n} (0, I_{n})$

PDF

$f (z) = (2 π)^{- \frac{n}{2}} exp (- \frac{1}{2} z^{⊺} z)$

MGF

$M_{z} (t) = exp (\frac{1}{2} t^{⊺} t)$

Properties

PDF

$f (x) = (2 π)^{- \frac{n}{2}} ∣Σ ∣^{- \frac{1}{2}} exp (- \frac{1}{2} (x - μ)^{⊺} Σ^{- 1} (x - μ))$

Mean

$E (X) = μ$

Variance

$Var (X) = Σ$

MGF

$M_{x} (t) = exp (t^{⊺} μ + \frac{1}{2} t^{⊺} Σ t)$

Affine Transformation

Let $X \sim N_{n} (μ, Σ)$ be a Random Variable following multivariate normal distribution, $A$ be a $m \times n$ matrix, and $b$ be a $m$ dimensional vector, then $A X + b \sim N_{m} (A μ + b, A Σ A^{⊺})$

Relationship with Chi-squared Distribution

Suppose $X \sim N_{n} (μ, Σ)$ be a Random Variable following multivariate normal distribution, then $Σ^{- \frac{1}{2}} (X - μ) \sim N_{n} (0, I_{n})$ $(X - μ)^{⊺} Σ^{- 1} (X - μ) \sim χ^{2} (n)$

Facts

Let $X \sim N_{n} (μ, Σ)$ , $C$ be an $m \times n$ , and $d$ be a $m$ -dimensional vector, then $C X + D \sim N_{m} (C μ + d, C Σ C^{⊺})$

Let $Y \sim N_{n} (μ, Σ)$ , $Y = (Y_{1}, Y_{2})^{⊺}$ , $μ = (μ_{1}, μ_{2})^{⊺}$ , $Σ = (Σ_{11} Σ_{21} Σ_{12} Σ_{22})$ , where $Y_{1}$ is $p \times 1$ and $Y_{2}$ is $(n - p) \times 1$ vectors. Then, $Y_{1}$ and $Y_{2}$ are independent $\Leftrightarrow Σ_{12} = O_{p \times (n - p)}$

Let $Y \sim N_{n} (μ, Σ)$ , $u = A Y$ , and $v = B Y$ , where $A$ is $m \times n$ , $B$ is $l \times n$ matrices. $u$ and $v$ are independent $\Leftrightarrow Cov (u, v) = A Σ B^{⊺} = O_{m \times l}$

Link to original

t-and F-Distribution

Student's t-Distribution
Definition

Let $W \sim N (0, 1)$ be a standard normal distribution, $V \sim χ^{2} (r)$ be a Chi-squared Distribution, and $W, V$ be independent, then $\frac{W}{V / r} \sim t (r)$ where $r \in R^{+}$ is the degrees of freedom

Properties

PDF

$f_{r} (t) = \frac{Γ ( \frac{r + 1}{2} )}{π r Γ ( \frac{r}{2} )} (1 + \frac{t ^{2}}{r})^{- (r + 1) /2} \in R$

Mean

$E (X) = 0$

Variance

$Var (X) = {\frac{r}{r - 2} \infty r > 2 1 < r \leq 1$
Link to original

F-Distribution
Definition

Let $U \sim χ^{2} (r_{1}), V \sim χ^{2} (r_{2})$ be independent random variables following Chi-squared distributions, then $\frac{U / r _{1}}{V / r _{2}} \sim F (r_{1}, r_{2})$ where $r_{1}, r_{2} \in N$ are the degrees of freedoms

Properties

PDF

${\Gamma(\frac{r_{1} + r_{2}}{2}) (\frac{r_{1}}{r_{2}})^{\frac{r_{1}}{2}} x^{\frac{r_{1}}{2}-1}} {\Gamma(\frac{r_{1}}{2}) \Gamma(\frac{r_{2}}{2}) (\frac{r_{1}}{r_{2}}x+1)^{(r_{1} + r_{2})/2}}$$ ## Mean $$E(X) = \frac{r_{2}}{r_{2} - 2},\quad r_{2}>2$$ ## Variance $$\operatorname{Var}(X) = \frac{2(r_{1}+r_{2}-2)}{r_{1} (r_{2}-2)^{2}(r_{2}-4)},\quad r_{2}>4$$$ Link to original

Student's Theorem
Definition

Let $X_{i} \sim N (μ, σ^{2})$ be i.i.d. Random Vector following Normal Distribution, $\overline{X} := \frac{1}{n} i = 1 \sum n X_{i}$ , and $s^{2} := \frac{1}{n - 1} i = 1 \sum n (X_{i} - \overline{X})^{2}$ , then

$\overline{X} \sim N (μ, \frac{σ ^{2}}{n})$

$\overline{X}$ and $s^{2}$ are independent

$\frac{1}{σ ^{2}} i = 1 \sum n (X_{i} - \overline{X})^{2} = \frac{( n - 1 ) s ^{2}}{σ ^{2}} \sim χ^{2} (n - 1)$

$\frac{X - μ}{s / n} \sim t (n - 1)$

Link to original

Consistency, and Limiting Distributions

Convergence in Probability

Convergence in Probability
Definition

$(\forall ϵ > 0, n \to \infty lim P (∣ X_{n} - X ∣ \geq ϵ) = 0) \Leftrightarrow (X_{n} \to P X)$

Let ${X_{n}}$ be a Sequence of random variables and $X$ be a Random Variable, then $X_{n}$ converges in probability to $X$ if $\forall ϵ > 0, n \to \infty lim P (∣ X_{n} - X ∣ \geq ϵ) = 0$ , and denoted by $X_{n} \to P X$

Facts

If the Sequence of random variables ${X_{n}}$ converges in probability to $X$ , then it also Almost Surely converge to $X$ .

$(X_{n} \to P X) \Rightarrow (X_{n} \to D X)$

Convergence in Probability implies convergence in distribution

Link to original

Convergence in probability has Linearity

Additivity $(\overline{X}_{n} \to P X) \land (\overline{Y}_{n} \to P Y) \Rightarrow X_{n} + Y_{n} \to P X + Y$

Homogeneity $\forall a \in R, X_{n} \to P X \Rightarrow a X_{n} \to P a X$

Continuous Mapping Theorem
Definition

Continuous functions preserve convergences (in probability, Almost Surely, or in distribution) of a Sequence of random variables to limits.

Consider a Sequence ${X_{n}}$ of random variables defined on same Probability Space, and a Continuous Function $g$ on the space. Then,

Convergence in Probability to a Constant: $X_{n} \to P a \in R ⟹ g (X_{n}) \to P g (a)$

Convergence in Probability: $X_{n} \to P X ⟹ g (X_{n}) \to P g (X)$

Almost Sure Convergence: $X_{n} \to a . s . X ⟹ g (X_{n}) \to a . s . g (X)$

Convergence in Distribution: $X_{n} \to D X ⟹ g (X_{n}) \to D g (X)$

Link to original
Link to original

Consistency
Definition

$\hat{θ}_{n} \to p θ$

A Statistic $\hat{θ}_{n}$ is called consistent estimator of $θ$ if $\hat{θ}$ converges in probability to $θ$
Link to original

Converge in Distribution

Convergence in Distribution
Definition

$(\forall x \in C (F_{X}), n \to \infty lim F_{X_{n}} (x) = F_{X} (x)) \Rightarrow (X_{n} \to D X)$

Let ${X_{n}}$ be a Sequence of random variables with CDF $F_{X_{n}}$ , $X$ be a Random Variable with cdf $F_{X}$ , and $C (F_{X})$ be a set of every point at which $F_{X}$ is continuous, then $X_{n}$ converges in distribution to $X$ if $\forall x \in C (F_{X}), n \to \infty lim F_{X_{n}} (x) = F_{X} (x)$ , and denoted by $X_{n} \to D X$ . $X$ is called the limiting distribution of $X_{n}$ or asymptotic distribution of $X_{n}$

Facts

$(X_{n} \to P X) \Rightarrow (X_{n} \to D X)$

Convergence in Probability implies convergence in distribution

$(\forall c \in R, X_{n} \to P c) \Leftrightarrow (X_{n} \to D c)$

$((\overline{X}_{n} \to D X) \land (\overline{Y}_{n} \to P 0)) \Rightarrow (X_{n} + Y_{n} \to D X)$

Continuous Mapping Theorem
Definition

Continuous functions preserve convergences (in probability, Almost Surely, or in distribution) of a Sequence of random variables to limits.

Consider a Sequence ${X_{n}}$ of random variables defined on same Probability Space, and a Continuous Function $g$ on the space. Then,

Convergence in Probability to a Constant: $X_{n} \to P a \in R ⟹ g (X_{n}) \to P g (a)$

Convergence in Probability: $X_{n} \to P X ⟹ g (X_{n}) \to P g (X)$

Almost Sure Convergence: $X_{n} \to a . s . X ⟹ g (X_{n}) \to a . s . g (X)$

Convergence in Distribution: $X_{n} \to D X ⟹ g (X_{n}) \to D g (X)$

Link to original

Slutzky Theorem
Definition

Let ${X_{n}}, {Y_{n}}, {Z_{n}}$ be Sequence of random variables, $X$ be a Random Variable, $k_{1}, k_{2}$ be constants, then $((X_{n} \to D X) \land (Y_{n} \to D k_{1}) \land (Z_{n} \to D k_{2})) \Rightarrow (Y_{n} + Z_{n} X_{n} \to D k_{1} + k_{2} X)$
Link to original

Let ${P_{n}}_{n \in N}$ be a sequence of distributions. Then, $D_{K L} (P_{n} ∣∣ P) \to 0 ⟹ D_{J S} (P_{n}, P) \to 0 ⟺ δ (P_{n}, P) \to 0 ⟹ W (P_{n}, P) \to 0 ⟺ P_{n} \to D P$ The convergence of the KL-Divergence to zero implies that the JS-Divergence also converges to zero. The convergence of the JS-Divergence to zero is equivalent to the convergence of the Total Variation Distance to zero. The convergence of the Total Variation Distance to zero implies that the Wasserstein Distance also converges to zero. The convergence of the Wasserstein Distance to zero is equivalent to the Convergence in Distribution of the sequence.

Link to original
Link to original

Boundedness in Probability
Definition

$O_{P} (1) \Leftrightarrow \forall ϵ > 0, \exists B \in R^{+}, \exists N \in N, \forall n \geq N, P (∣ X_{n} ∣ \geq B) \leq ϵ$

Facts

Let ${X_{n}}$ be a Random Vector, and $X$ be a Random Variable, then If $X_{n} \to X$ , then ${X_{n}}$ is bounded in probability

Link to original

Big O Notation
Definition

Let ${r_{n}} \in (0, \infty)$ be the rate of convergence, and ${x_{n}}$ be a Sequence, then

$x_{n} = o (r_{n}) \Leftrightarrow n \to \infty lim \frac{x _{n}}{r _{n}} = 0 \Leftrightarrow \forall ϵ > 0, \exists N \in N, \forall n \geq N, \frac{x _{n}}{r _{n}} < ϵ$

$x_{n} = O (R_{n}) \Leftrightarrow n \to \infty lim sup \frac{x _{n}}{r _{n}} < \infty \Leftrightarrow \exists M \in R^{+}, \exists N \in N, \forall n \geq N, \frac{x _{n}}{r _{n}} \leq M$
Link to original

Big O in Probability Notation
Definition

Extension of Big O Notation for Random Variable

Let ${r_{n}} \in (0, \infty)$ be the rate of convergence, and ${X_{n}}$ be a Sequence of random variables, then $X_{n} = o_{P} (r_{n}) \Leftrightarrow n \to \infty lim \frac{X _{n}}{r _{n}} \to P 0 \Leftrightarrow \forall ϵ, M > 0, \exists N \in N, \forall n \geq N, P (\frac{X _{n}}{r _{n}} \geq M) \leq ϵ$

$X_{n} = O_{P} (r_{n}) \Leftrightarrow \forall ϵ > 0, \exists M \in R^{+}, \exists N \in N, \forall n \geq N, P (\frac{X _{n}}{r _{n}} \geq M) \leq ϵ$

Facts

If $X_{n} = O_{P} (1)$ , then ${X_{n}}$ is called Boundedness in Probability

$X_{n} \to D X \Rightarrow X_{n} = O_{P} (1)$

$(X_{n} = O_{P} (1)) \land (Y_{n} \to P 0) \Rightarrow X_{n} Y_{n} \to P 0$

$(Y_{n} = O_{P} (1)) \land (X_{n} = o_{P} (Y_{n})) \Rightarrow X_{n} \to P 0$

Link to original

Delta Method
Definition

Univariate Delta Method

Let ${X_{n}}$ be a sequence of random variables satisfying $n (X_{n} - θ) \to D N (0, σ^{2})$ , $g$ be a differentiable function at $x = θ$ , and $g^{'} (θ) \neq = 0$ , then $n (g (X_{n}) - g (θ)) \to D g^{'} (θ) N (0, σ^{2}) = N (0, σ^{2} g^{'} (θ)^{2})$

Proof

By Taylor Series approximation $g (X_{n}) = g (θ) + g^{'} (θ) (X_{n} - θ) + o_{P} (∣ X_{n} - θ ∣)$ $\Rightarrow n (g (X_{n}) - g (θ)) = g^{'} (θ) n (X_{n} - θ) + o_{P} (n ∣ X_{n} - θ ∣)$

Where $n (X_{n} - θ) \to D N (0, σ^{2})$ by the assumption. By the continuous mapping theorem, $n ∣ X_{n} - θ ∣$ , which is a function of $X_{n}$ , also converges to random variable, so it is Boundedness in Probability $n ∣ X_{n} - θ ∣ = O_{P} (1)$ by the property of converging random vector

Therefore, $n (g (X_{n}) - g (θ)) = g^{'} (θ) N (0, σ^{2}) + o_{P} (O_{P} (1)) = g^{'} (θ) N (0, σ^{2}) = N (0, σ^{2} g^{'} (θ)^{2})$ by the property of sequence of random variables bounded in probability

Examples

Estimation of the sample variance of Bernoulli Distribution

Let $X_{1}, X_{2}, \dots, X_{n} \sim B er (p)$ $n (\overset{ˉ}{X}_{n} - p) \to N (0, p (1 - p))$ by CLT

$n (g (\overset{ˉ}{X}_{n}) - g (p)) \to g^{'} (p) N (0, p (1 - p))$ by Delta method

Let $g (x) = x (1 - x)$ , then $n (\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n}) - p (1 - p)) \to N (0, p (1 - p) (1 - 2 p)^{2})$

Therefore, the sample mean and variance follow such distributions $\overset{ˉ}{X}_{n} \to N (p, \frac{p ( 1 - p )}{n}) = N (μ, \frac{σ}{n})$ $\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n}) = \overset{σ}{^} \to N (p (1 - p), \frac{p ( 1 - p ) ( 1 - 2 p ) ^{2}}{n}) = N (σ, \frac{σ ( 1 - 2 p ) ^{2}}{n})$

Visualization

x-axis: $μ = p$

y_axis: $σ^{2} = p (1 - p)$

y1: $y = p (1 - p)$ ^[variance by mean]

y2: $\overset{ˉ}{X}_{n}$ ^[sample mean]

y3, y4: $\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n})$ ^[sample variance that calculated by the sample mean]

y5: first order approximated line at $(p, p (1 - p))$

If sample size $n \to \infty$ , variance of sample mean $V (\overset{ˉ}{X}_{n}) \to 0$ . So, sample variance $\overset{σ}{^}$ can be well approximated by first-order approximation.
Link to original

MGF Technique
Definition

Let ${X_{n}}$ be a Sequence of random variables with MGF $M_{X_{n}} (t)$ and $X$ be a Random Variable with MGF $M_{X} (t)$ , then $n \to \infty lim M_{X_{n}} (t) = M_{X} (t) \Leftrightarrow X_{n} \to D X$

A technique calculating limiting distribution using Moment Generating Function

Facts

$n \to \infty lim ψ (n) = 0 \Rightarrow n \to \infty lim (1 + \frac{b}{n} + \frac{ψ ( n )}{n})^{c n} = e^{b c}$

Link to original

Central Limit Theorem

Central Limit Theorem
Definition

Let $X_{1}, \dots, X_{n}$ be i.i.d. Random Sample from a distribution with mean $μ$ , variance $σ^{2}$ , then $\frac{X - μ}{σ / n} \to D Z \sim N (0, 1)$

For i.i.d random variables that have finite variance, the sample mean converges to Normal Distribution.
Link to original

Asymptotics for Multivariate Distributions

Asymptotics for Multivariate Distributions
Definition

Convergence in Probability

$X_{n} \to P X \Leftrightarrow (\forall ϵ > 0, n \to \infty lim P (∣∣ X_{n} - X ∣∣ \geq ϵ) = 0)$

Let ${X_{n}}$ be a Sequence of random vectors and $X$ be a Random Vector, then $X_{n}$ converges in probability to $X$ if $\forall ϵ > 0, n \to \infty lim P (∣∣ X_{n} - X ∣∣ \geq ϵ) = 0$ , and denoted by $X_{n} \to P X$

Let ${X_{n}}$ be a Sequence of random vectors and $X$ be a Random Vector, then $(\forall j \in {1, 2, \dots, p}, X_{nj} \to P X_{j}) \Leftrightarrow X_{n} \to P X$

Convergence in Distribution

$X_{n} \to D X \Leftrightarrow (\forall x \in C (F_{X}), n \to \infty lim F_{X_{n}} (x) = F_{X} (x))$

Let ${X_{n}}$ be a Sequence of random vectors with CDF $F_{X_{n}}$ , $X$ be a Random Variable with cdf $F_{X}$ , and $C (F_{X})$ be a set of every point at which $F_{X}$ is continuous, then ${X_{n}}$ converges in distribution to $X$ if $\forall x \in C (F_{X}), n \to \infty lim F_{X_{n}} (x) = F_{X} (x)$ , and denoted by $X_{n} \to D X$ . $X$ is also called limiting distribution of $X_{n}$ or asymptotic distribution of $X_{n}$

Continuous Mapping Theorem

Let $g$ be a Continuous Function $X_{n} \to D X \Rightarrow g (X_{n}) \to D g (X)$

MGF Technique

Let ${X_{n}}$ be a Sequence of random vectors with MGF $M_{X_{n}} (t)$ and $X$ be a Random Vector with MGF $M_{X} (t)$ , then $n \to \infty lim M_{X_{n}} (t) = M_{X} (t) \Leftrightarrow X_{n} \to D X$

Central Limit Theorem

Let ${X_{n}}$ be i.i.d. Sequence of random vectors from a distribution with mean $E (X_{1}) = μ$ , variance-covariance matrix $Var (X_{1}) = Σ$ , then $n (\overline{X} - μ) \to D N_{p} (0, Σ)$

For i.i.d random variables that have finite variance, the sample mean converges to Normal Distribution.

Delta Method

Let ${X_{n}}$ be a Sequence of p-dimensional random vectors with $n (X_{n} - μ) \to D N_{p} (0, Σ)$ , $g : R^{p} \to R^{k}$ , $B = (\frac{\partial g _{i}}{\partial μ _{i}})$ be a $k \times p$ matrix, and $B (μ) \neq = 0$ , then $n (g (X_{n}) - g (μ)) \to D N_{k} (0, B Σ B^{⊺})$

Facts

Let $v^{⊺} = (v_{1}, v_{2}, \dots, v_{p}) \in R^{p}$ , then $\forall j \in {1, 2, \dots p}, ∣ v_{j} ∣ \leq ∣∣ v ∣∣ \leq i = 1 \sum n ∣ v_{i} ∣$

Let ${X_{n}}$ be a Sequence of p-dimensional random vectors, $X_{n} \to D N (μ, Σ)$ , $A$ be a $m \times p$ constant matrix, and $b$ be a $m$ dimensional constant vector, then $A X_{n} + b \to D N (A μ + b, A Σ A^{⊺})$

Link to original

Some Elementary Statistical Inferences

Sample and Statistic

Random Sample
Definition

A realization of i.i.d. Random Vector

Properties

Let $x = (X_{1}, X_{2}, \dots, X_{m})^{⊺}, y = (Y_{1}, Y_{2}, \dots, Y_{n})^{⊺}$ be random samples, $A$ be $k \times m$ constant matrix, and $B$ be $p \times n$ constant matrix, then the Expected Value and Covariance Matrix of random vectors become

$E (A x) = A E (x)$

$Var (A x) = A Cov (x) A^{⊺}$

$Cov (A x, B y) = A Cov (x, y) B^{⊺}$

Link to original

Statistic
Definition

A quantity computed from values in a Random Sample
Link to original

Bias of an Estimator
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from $f (x ∣ θ)$ where $θ \in Ω$ is a parameter, and $\hat{θ} = T (X_{1}, \dots, X_{n})$ be a Statistic. $Bias (\hat{θ}) = E ((\hat{θ})) - θ$

$\forall θ \in Ω, E (\hat{θ}) = θ \Rightarrow \hat{θ}$ is unbiased estimator An estimator is unbiased if its bias is equal to zero for all values of parameter $θ$ .
Link to original

Order Statistic

Order Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample from a population with a PDF $f (x)$ and CDF $F (x)$ . And let $Y_{1}$ be the smallest of $X_{i}$ ‘s, $Y_{2}$ be the 2nd smallest of $X_{i}$ ‘s , $\dots$ , and $Y_{n}$ be the largest of $X_{i}$ ‘s, then $Y_{1} < Y_{2} < \dots < Y_{n}$ is called the order statistic of $X_{1}, X_{2}, \dots, X_{n}$

Properties

Joint PDF

$g (y_{1}, y_{2}, \dots, y_{n}) = n! \prod_{i = 1}^{n} f (y_{i})$ where $y_{1} < y_{2} < \dots y_{n}$

Marginal PDF

$g_{k} (y_{k}) = \frac{n !}{( k - 1 )! ( n - k )!} (F (y_{k}))^{k - 1} (1 - F (y_{k}))^{n - k} f (y_{k})$

Joint PDF of $Y_{i}$ and $Y_{j}$

Suppose $i < j$ $g_{ij} (y_{i}, y_{j}) = \frac{n !}{( i - 1 )! ( j - i - 1 )! ( n - j )!} (F (y_{i}))^{i - 1} (F (y_{j}) - F (y_{i}))^{j - i - 1} (1 - F (y_{j}))^{n - j} f (y_{i}) f (y_{j})$

Joint PDF of $Y_{l_{1}}, Y_{l_{2}}, \dots, Y_{l_{k}}$

Suppose $l_{1} < l_{2} < \dots < l_{k}$ $g_{l_{1}, l_{2}, \dots, l_{k}} (y_{l_{1}}, y_{l_{2}}, \dots, y_{l_{k}}) = \frac{n !}{\prod _{i = 1}^{k + 1} ( l _{i} - l _{i - 1} - 1 )!} i = 1 \prod k f (y_{l_{i}}) i = 1 \prod k + 1 [F (y_{l_{i}}) - F (y_{l_{i - 1}})]^{l_{i} - l_{i - 1} - 1}$ where $l_{0} = 0, l_{k + 1} = n + 1, x_{(l_{0})} = - \infty, x_{(l_{k + 1})} = \infty$
Link to original

Quantiles
Definition

Let $X$ be a Random Variable with cdf $F (x)$ , and $ξ_{p} = F^{- 1} (p)$ be a $p$ -th quantile of $X$

Properties

Estimator of Quantile

Let $Y_{1} < \dots < Y_{n}$ be order statistics, then we can define an estimator of quantile using the order statistic $ξ_{p} = Y_{k}$ where $k = ⌊(n + 1) p ⌋$

and call the $Y_{k}$ a $p$ -th sample quantile

Confidence Intervals of Quantiles

$P (Y_{i} < ξ_{p} < Y_{j}) = k = i \sum j - 1 (k n) p^{k} (1 - p)^{n - k} = γ$ where $i < ⌊(n + 1) p ⌋ < j$

Select $i, j$ satisfying the equation
Link to original

Tolerance Limits for Distributions

Tolerance Limits for Distributions
Definition

Let $X_{1}, \dots, X_{n}$ be Random Sample from a distribution with cdf $F (x)$ , and $Y_{1} < \dots < Y_{n}$ be Order Statistic, then If $P [F (Y_{i}) - F (Y_{j}) \geq p] = γ$ , then $(Y_{i}, Y_{j})$ is $100 γ %$ tolerance limits for $100 p %$ of the probability for the distribution of $X$

Properties

Joint PDF

$h (z_{1}, z_{2}, \dots, z_{n}) = {n! 0 0 < z_{1} < z_{2} < \dots < z_{n} < 1 otherwise$

Computation of $γ$

$P (Z_{i} - Z_{j} \geq p) = P (Z_{j - i} \geq p) = γ = \int_{p}^{1} h_{j - i} (v) d v$ where $h_{k} (v) = \frac{n !}{( k - 1 )! ( n - k )!} v^{k - 1} (1 - v)^{n - k}$

A probability that the tolerance interval $(Z_{j}, Z_{i})$ contains $100 p %$ of the probability
Link to original

Introduction to Hypothesis Testing

Hypothesis Testing
Definition

Types of Errors

$H_{0}$ is true $H_{1}$ is true
Not reeject $H_{0}$ $1 - α$ type 2 error ( $β$ )
Reject $H_{0}$ type 1 error ( $α$ ) $1 - β$

Let the set $C$ where $H_{0}$ is rejected be a rejection region and a statistic $T$ for testing be a test statistic, then the size $α$ is the maximum of probability of type 1 error a test of size $α$ is $max P_{H_{0}} (T \in C) \leq α$ a power of the test is $P_{H_{1}} (T \in C) = 1 - β$ the probability of the type 2 error is $P_{H_{1}} (T \in / C) = β$

p-value

The probability of obtaining test results at least as extreme as the result of actually observed under $H_{0}$
Link to original

	$H_{0}$ is true	$H_{1}$ is true
Not reeject $H_{0}$	$1 - α$	type 2 error ( $β$ )
Reject $H_{0}$	type 1 error ( $α$ )	$1 - β$

Goodness-of-Fit Test for Multinomial Distribution
Definition

Let $X \sim Mult (n, p_{1}, \dots, p_{c - 1})$ be a Multinomial Distribution and consider a null hypothesis $H_{0} : p_{1} = P_{10}, p_{2} = P_{20}, \dots, p_{c} = P_{c 0}$ . Then, the test Statistic, which follows Chi-squared Distribution, is defined as $Q_{c - 1} = i = 1 \sum c \frac{( X _{i} - n p _{i} ) ^{2}}{n p _{i}} \to D χ^{2} (c - 1)$
Link to original

Homogeneity Test for Multinomial Distribution
Definition

Let $X_{1} \sim Mult (n_{1}, p_{11}, \dots, p_{c 1}), X_{2} \sim Mult (n_{2}, p_{12}, \dots, p_{c 2}), \dots, X_{r} \sim Mult (n_{r}, p_{1 r}, \dots, p_{cr})$ and consider a null hypothesis $H_{0} : p_{11} = p_{12} = \dots = p_{1 r}, p_{21} = p_{22} = \dots = p_{2 r}, p_{c 1} = p_{c 2} = \dots = p_{cr}$ . Then, the test Statistic, which follows Chi-squared Distribution, is defined as $Q = j = 1 \sum r i = 1 \sum c \frac{( X _{ij} - n _{j} p ^ _{ij} ) ^{2}}{n _{j} p ^ _{ij}} \to D χ^{2} ((r - 1) (c - 1))$ where $\overset{p}{^}_{ij} = \frac{\sum _{j = 1}^{r} X _{ij}}{\sum _{j = 1}^{r} n _{j}}$
Link to original

Independence Test for Two Discrete Variables
Definition

Let category variables $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{c}$ and consider a null hypothesis $H_{0} : A, B$ are independent. Then, the test Statistic, which follows Chi-squared Distribution, is defined as $Q = j = 1 \sum b i = 1 \sum a \frac{( X _{ij} - n p ^ _{ij} ) ^{2}}{n p ^ _{ij}} \to D χ^{2} ((r - 1) (c - 1))$ where $n = j \sum i \sum X_{ij}$ , $\overset{p}{^}_{ij} = \overset{p}{^}_{i .} \overset{p}{^}_{. j} = \frac{X _{i .}}{n} \frac{X _{. j}}{n}$ , $X_{i} = j \sum X_{ij}$ , $X_{j} = i \sum X_{ij}$
Link to original

The Method of Monte Carlo

Inverse Transform Sampling
Definition

$X = F_{X}^{- 1} (U)$

we can generate $X$ with CDF $F_{X} (x)$ by using inverse function of CDF $F_{X}^{- 1}$ and uniform random variable $U \sim u [0, 1]$

Facts

Let $U \sim u (0, 1)$ and $F$ be continuous CDF, then $F^{- 1} (U) \sim F$

Link to original

Box-Muller Transformation
Definition

Let $U_{1}, U_{2} \sim ii d U (0, 1)$ , $R^{2} /2 \sim E x p (1)$ , and $θ \sim U (0, 2 π)$ , then $X_{1} = R cos (θ) = - 2 l o g U_{1} cos (2 π U_{2}) \sim N (0, 1)$ and $X_{2} = R sin (θ) = - 2 l o g U_{1} sin (2 π U_{2}) \sim N (0, 1)$

The method that generates pairs of i.i.d. normal distribution by using two uniform distributions or Exponential Distribution
Link to original

Accept Reject Generation Algorithm
Definition

Let $Y \sim g (y), U \sim u (0, 1)$ be independent random variables, $f (x)$ be a PDF which we want to sample, and $M 0, \frac{f ( x )}{g ( x )} \leq M$ step 1: Generate $Y$ and $U \sim u (0, 1)$ . step 2: If $U \leq \frac{f ( y )}{M g ( y )} \Rightarrow X := Y$ otherwise, go to step 1. step 3. repeat step 1 and 2.
Link to original

Bootstrap Procedures

Bootstrapping
Definition

Let $X \sim f (x ∣ θ)$ be a random variable with $θ \in Ω$ , $X = (X_{1}, X_{2}, \dots, X_{n})$ be a Random Sample of $X$ , and $\hat{θ} = \hat{θ} (X)$ be a point estimator of $θ$ Then, the bootstrap sample is $n$ -dimensional sample vector $X^{*} = (X_{1}^{*}, X_{2}^{*}, \dots, X_{n}^{*})$ drawn with replacement from the vector of original samples $X = (X_{1}, X_{2}, \dots, X_{n})$

Bootstrap Confidence Interval

Draw $B$ (large number) bootstrap samples $X_{1}^{*}, X_{2}^{*}, \dots, X_{B}^{*}$ and calculate confidence interval using the samples Calculate point estimators using the bootstrap samples $\hat{θ}_{j}^{*} := \hat{θ} (X_{j}^{*})$ , and define order statistics $\hat{θ}_{(1)}^{*} \leq \hat{θ}_{(2)}^{*} \leq \dots \leq \hat{θ}_{(B)}^{*}$ for the estimators. Then, $100 (1 - α) %$ confidence bootstrap confidence interval for $θ$ is $(\hat{θ}_{(m)}^{*}, \hat{θ}_{(B + 1 - m)}^{*})$ , where $m = ⌊ \frac{α}{2} b ⌋$
Link to original

Maximum Likelihood Methods

Maximum Likelihood Estimation

Likelihood Function
Definition

$L (θ ∣ x) := L (θ) = \prod_{i = 1}^{n} f (x_{i} ∣ θ)$

Log-Likelihood Function

$l (θ) := ln L (θ) = \sum_{i = 1}^{n} ln f (x_{i} ∣ θ)$
Link to original

Maximum Likelihood Estimation
Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , then the MLE $\hat{θ}_{MLE}$ of $θ$ is estimated as $\hat{θ}_{MLE} = θ argmax L (θ ∣ x)$

Regularity Conditions

R0: The pdfs are distinct, i.e. $θ \neq = θ^{'} \Rightarrow f (x_{i} ∣ θ) \neq = f (x_{i} ∣ θ^{'})$

R1: The pdfs have same supports $\forall θ$

R2: The true value $θ_{0}$ is an interior point in $Ω$

R3: The pdf $f (x ∣ θ)$ is twice differentiable with respect to $θ \in Ω$

R4: $\frac{\partial}{\partial θ ^{2}} \int f (x ∣ θ) d x = \int \frac{\partial}{\partial θ ^{2}} f (x ∣ θ) d x$

R5: The pdf $f (x ∣ θ)$ is three times differentiable with respect to $θ \in Ω$ , $\forall θ \in Ω, \frac{\partial ^{3}}{\partial θ ^{3}} ln f (x ∣ θ) \leq M (x)$ , and $\exists c \in R, \exists M (x), \forall∣ θ - θ_{0} ∣ < c, \forall$ interior point $x, E_{θ_{0}} [M (X)] < \infty$

Properties

Functional Invariance

If $\hat{θ}$ is the MLE for $θ$ , then $g (\hat{θ})$ is the MLE of $g (θ)$

Consistency

Under R0 ~ R2 Regularity Conditions, let $θ_{0}$ be a true parameter, $f (x ∣ θ)$ is differentiable with respect to $θ \in Ω$ , then $\frac{\partial}{\partial θ} L (θ) = 0$ has a solution $\hat{θ}_{n}$ such that $\hat{θ}_{n} \to P θ_{0}$

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $\hat{θ}_{n}$ be a consistent Sequence of solutions of MLE equation $\frac{\partial l ( θ )}{\partial θ} = 0$ , and $0 < I (θ_{0}) < \infty$ , then $n (\hat{θ}_{n} - θ_{0}) \to D N (0, \frac{1}{I ( θ _{0} )})$ where $I (θ_{0})$ is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, $n I (\hat{θ}) (\hat{θ} - θ) \to D N (0, 1)$ Thus, $100 (1 - α) %$ confidence interval of for $θ$ is $(\hat{θ} - z_{α /2} \frac{1}{n I ( θ ^ )}, \hat{θ} + z_{α /2} \frac{1}{n I ( θ ^ )})$

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let $g (x)$ be a continuous function and $g^{'} (θ_{0}) \neq = 0$ , then $n (g (\hat{θ}_{n}) - g (θ_{0})) \to D N (0, \frac{g ^{'} ( θ _{0} ) ^{2}}{I ( θ _{0} )})$

Facts

Under R0 and R1 regularity conditions, let $θ_{0}$ be a true parameter, then $\forall θ \neq = θ_{0}, n \to \infty lim P_{θ_{0}} [L (θ_{0}) > L (θ)] = 1$

Link to original

Rao-Cramer Lower Bound and Efficiency

Bartlett Identities
Definition

First Bartlett Identity

$0 = E [\frac{\partial l n f ( X ∣ θ )}{\partial θ}] = E [s (θ ∣ x)]$ where $f (X ∣ θ)$ is a Likelihood Function and $s (θ ∣ x)$ is a Score Function

Second Bartlett Identity

$0 = E [\frac{\partial ^{2} l n f ( X ∣ θ )}{\partial θ ^{2}}] + E [(\frac{\partial l n f ( X ∣ θ )}{\partial θ})^{2}]$ where $f (X ∣ θ)$ is a Likelihood Function
Link to original

Score Function
Definition

$s (θ ∣ x) := \frac{\partial l n L ( θ ∣ x )}{\partial θ}$

The gradient of the log-likelihood function with respect to the parameter vector. The score indicates the steepness of the log-likelihood function

Facts

The score will vanish at a local Extremum

Link to original

Fisher Information
Definition

Fisher Information

$I(\theta) &:= E\left[ \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} \right] = \int_{\mathbb{R}} \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} p(x, \theta)dx\\ &= -E\left[ \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} \right] = -\int_{\mathbb{R}} \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} p(x, \theta)dx\\ \end{aligned}$$ by the [[Bartlett Identities#second-bartlett-identity|second Bartlett identity]] $$I(\theta) = \operatorname{Var}\left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right) = \operatorname{Var}(s(\theta|x))$$ where $s(\theta|x)$ is a [[Score Function]] ## Fisher Information Matrix ![[Pasted image 20231224171415.png|800]] Let $\mathbf{X}$ be a [[Random Vector]] with [[Density Function|PDF]] $f(x|\boldsymbol{\theta})$, where $\boldsymbol{\theta} \in \Omega \subset R^{p}$, then the **Fisher information matrix** for on $\boldsymbol{\theta}$ is a $p \times p$ matrix defined as $$I(\boldsymbol{\theta}) := \operatorname{Cov}\left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) = E\left[ \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right)^\intercal \right] = -E\left[ \cfrac{\partial^{2}}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\intercal} \ln f(x|\boldsymbol{\theta}) \right]$$ and $jk$-th element of $I(\boldsymbol{\theta})$, $I_{jk} = - E\left[ \cfrac{\partial^{2}}{\partial \theta_{j} \partial\theta_{k}} \ln f(x|\boldsymbol{\theta}) \right]$ # Properties ## Chain Rule The information in length $n$ [[Random Sample]] $X_{1}, X_{2}, \dots, X_{n}$ is $n$ times the information in a single sample $X_{i}$ $I_\mathbf{X}(\theta) = n I_{X_{1}}(\theta)$ # Facts > In a location model, information is not dependent on a location parameter. > $$I(\theta) = \int_{-\infty}^{\infty}\left( \frac{f'(z)}{f(z)} \right)^{2} f(z)dz$$$ Link to original

Rao-Cramer Lower Bound
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ with the R0 ~ R4 regularity conditions and $Y = u (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic with $E (Y) = k (θ)$ , then $Var (Y) \geq \frac{k ^{'} ( θ ) ^{2}}{n I ( θ )}$

Facts

Let $Y = u (X_{1}, X_{2}, \dots, X_{n})$ be an unbiased estimator of $θ$ , then $E (Y) = k (θ) = θ$ . Thus, by a Rao-Cramer Lower Bound, $Var (Y) \geq \frac{1}{n I ( θ )}$

Link to original

Efficiency
Definition

Efficient Estimator

$Var (Y) = \frac{1}{n I ( θ )}$ where $I (θ)$ is a Fisher Information

An unbiased estimator $Y$ is called an efficient estimator if its variance is the Rao-Cramer Lower Bound

Efficiency

Let $Y$ be an unbiased estimator, then the efficiency of $Y$ , $0 \leq e (Y) \leq 1$ is $e (Y) = \frac{1/ n I ( θ )}{Var ( Y )}$

Asymptotic Efficiency

Let $X_{1}, X_{2}, \dots, X_{n}$ be i.i.d. random variables with PDF $f (x ∣ θ)$ , and $\hat{θ}_{1}$ be an estimator satisfying $n (\hat{θ}_{1} - θ_{0}) \to D N (0, σ_{1}^{2})$ , then

the asymptotic efficiency of $\hat{θ}_{1}$ is $e (\hat{θ}_{1}) = \frac{1/ I ( θ _{0} )}{σ _{1}^{2}}$

If $e (\hat{θ}_{1}) = 1$ , then $\hat{θ}_{1}$ is called asymptotically efficient

Assume that another estimator $\hat{θ}_{2}$ such that $n (\hat{θ}_{2} - θ_{0}) \to D N (0, σ_{2}^{2})$ Then, the asymptotic relative efficiency (ARE) of $\hat{θ}_{1}$ to $\hat{θ}_{2}$ is $e (\hat{θ}_{1}, \hat{θ}_{2}) = \frac{σ _{2}^{2}}{σ _{1}^{2}}$

Examples

In a Laplace distribution, $σ_{m}^{2} = 1, σ_{\overline{X}}^{2} = 2$ So, sample median is $2$ times more efficient than sample mean for the estimator of the location parameter.

In a Normal distribution, $σ_{m}^{2} = \frac{π}{2}, σ_{\overline{X}}^{2} = 1$ So, sample mean is $\frac{π}{2}$ times more efficient than sample median for the estimator of the location parameter.

Link to original

Newton's Method
Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation $f (x) = 0$

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n})$ The point such that the Taylor series is 0: $x_{n + 1} = x_{n} - \frac{f ( x _{n} )}{f ^{'} ( x _{n} )}$

multivariate version: $x_{n + 1} = f (x_{n}) - \nabla f (x_{n})^{- 1} f (x_{n})$

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n}) + \frac{f ^{''} ( x _{n} )}{2} (x - x_{n})^{2}$ The derivative of the quadratic approximation: $f^{'} (x_{n}) + f^{''} (x_{n}) (x - x_{n})$ The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: $x_{n + 1} = x_{n} - \frac{f ^{'} ( x _{n} )}{f ^{''} ( x _{n} )}$ multivariate version: $x_{n + 1} = x_{n} - \nabla^{2} f (x_{n})^{- 1} \nabla f (x_{n})$

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient and hessian are $\nabla f (x) = - 2 (b - A x)^{⊺}$ , $\nabla^{2} f (x) = 2 A^{⊺} A$

Then, solution is $x_{n + 1} = x_{n} + (A^{⊺} A)^{- 1} A^{⊺} (b - A x_{n})$ If $(A^{⊺} A)$ is invertible, $x_{n + 1} = (A^{⊺} A)^{- 1} A^{⊺} b$ is a Least Square solution.
Link to original

Fisher's Scoring Method
Definition

Fisher’s scoring method is a variation of Newton–Raphson method that uses Fisher Information instead of the Hessian Matrix

$\hat{θ}_{n + 1} = \hat{θ}_{n} - \frac{l ^{'} ( θ ^ _{n} )}{I ( θ ^ _{n} )}$ where $I (θ) = E [l^{''} (θ)]$ is the Fisher Information.
Link to original

Maximum Likelihood Tests

Likelihood Ratio Test
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ where $θ \in Ω$ , $H_{0} : θ \in ω, H_{1} : θ \in Ω \cap ω^{∁}$ , where $ω \subset Ω$ . $Λ = \frac{L ( ω ^ )}{L ( Ω ^ )}$ is called a likelihood ratio, where $\overset{ω}{^}, \hat{Ω}$ are MLE under $ω$ and $Ω$ respectively. If $H_{0}$ is true, then $Λ$ will be close to $1$ , while if $H_{0}$ is not true, then the $Λ$ will be close to $0$

Therefore, LRT rejects $H_{0}$ when $Λ \leq c$ , where the $c$ is determined by the level alpha condition. $α = P_{ω} (Λ \leq c)$

By the Wilks’ Theorem, the likelihood ratio test Statistic for the hypothesis $H_{0} : θ = θ_{0}, H_{1} : θ \neq = θ_{0}$ is given by $χ_{L}^{2} = λ_{L R} := - 2 ln Λ = - 2 ln \frac{L ( θ _{0} )}{L ( θ ^ )} \to D χ^{2} (1)$

Visualization

The comparison of the Likelihood Ratio Test, Wald Test, and Score Test
Link to original

Wilks' Theorem
Definition

Single Parameter

Under R0 ~ R5 regularity conditions, let $H_{0} : θ = θ_{0}$ , $- 2 ln Λ \to D χ^{2} (1)$ where $Λ = \frac{L ( θ _{0} )}{L ( θ ^ )}$ is the likelihood ratio, and the $\hat{θ}$ is the MLE of $θ$

Multiparameter

Under R0 ~ R5 regularity conditions, let $H_{0} : θ \in ω$ , $- 2 ln Λ \to D χ^{2} (q)$ where $Λ = \frac{L ( ω ^ )}{L ( Ω ^ )}$ is the likelihood ratio, and the $\hat{Ω}$ is the MLE of $θ$ without constraints.
Link to original

Wald Test
Definition

the Wald test Statistic for the hypothesis $H_{0} : θ = θ_{0}, H_{1} : θ \neq = θ_{0}$ is given by $χ_{W}^{2} = (n I (\hat{θ}) (\hat{θ} - θ_{0}))^{2} \to D χ^{2} (1)$
Link to original

Score Test
Definition

the Score test Statistic for the hypothesis $H_{0} : θ = θ_{0}, H_{1} : θ \neq = θ_{0}$ is given by $χ_{S}^{2} = (\frac{l ^{'} ( θ _{0} )}{n I ( θ _{0} )})^{2} \to D χ^{2} (1)$
Link to original

Multiparameter Case

Multiparametric Maximum Likelihood Estimation
Definition

Multiparameter MLE

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ = (θ_{1}, θ_{2}, \dots, θ_{p})^{⊺} \in Ω \subset R^{p}$ , then the MLE $\hat{θ}_{MLE}$ of $θ$ is estimated as $\hat{θ}_{MLE} = θ argmax L (θ ∣ x)$

Properties

Multiparameter Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ = (θ_{1}, θ_{2}, \dots, θ_{p})^{⊺} \in Ω \subset R^{p}$ , then

$\frac{\partial l ( θ )}{\partial θ} = 0$ has solution $\hat{θ}_{n}$ such that $\hat{θ}_{n} \to P θ$

$n (\hat{θ}_{n} - θ) \to D N_{p} (0, I^{- 1} (θ))$

$\forall j \in {1, 2, \dots, p}, n (\hat{θ}_{nj} - θ_{j}) \to D N (0, I^{- 1} (θ)_{jj})$

Delta Method for Multiparameter MLE

Multiparameter Rao-Cramer Lower Bound

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , $Y_{j} = u (X_{1}, X_{2}, \dots, X_{n})$ be an unbiased estimator of $θ_{j}$ , then $E (Y_{j}) = k (θ) = θ_{j}$ . Thus, by a Rao-Cramer Lower Bound, $Var (Y_{j}) \geq \frac{1}{n} I^{- 1} (θ)_{jj}$ where $I^{- 1} (θ)_{jj}$ is $j$ -th diagonal element of $I^{- 1} (θ)$

If $=$ holds, then $Y_{j}$ is called an efficient estimator of $θ_{j}$

In a location-scale model, the information of the parameters do not depend on the locale parameter.
Link to original

Multiparametric Likelihood Ratio Test
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample from PDF $f (x ∣ θ)$ , where $θ \in Ω \subset R^{p}$ , $p$ be the number of parameters, $q$ be the number of constraints (restrictions) $g_{1} (θ) = a_{1}, \dots, g_{q} (θ) = a_{q}$ , where $g_{i}$ is differentiable function $H_{0} : θ \in ω, H_{1} : θ \in Ω ∖ ω$ , where the dimension of $ω$ is $p - q$ .

Now, consider a test statistic to test the hypothesis $Λ := \frac{θ \in ω m a x L ( θ )}{θ \in Ω m a x L ( θ )} = \frac{L ( ω ^ )}{L ( Ω ^ )}$ If $H_{0}$ is true, then $Λ$ will be close to $1$ , while if $H_{0}$ is not true, then the $Λ$ will be close to $0$ Therefore, LRT rejects $H_{0}$ when $Λ \leq c$ , where the $c$ is determined by the level alpha condition. $α = P_{ω} (Λ \leq c)$

By the Wilks’ Theorem, the likelihood ratio test Statistic for the hypothesis $H_{0} : θ \in ω, H_{1} : θ \in Ω ∖ ω$ is given by $χ_{L}^{2} := - 2 ln Λ = - 2 ln \frac{L ( ω ^ )}{L ( Ω ^ )} \to D χ^{2} (q)$
Link to original

EM Algorithm

Expectation-Maximization Algorithm
Definition

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be an observed data, $Z = (Z_{1}, Z_{2}, \dots, Z_{n})$ be an unobserved (latent) variable, $X, Z$ are independent, $g (X ∣ θ)$ be a joint pdf of $X$ , $h (X, Z ∣ θ)$ be a joint pdf of $X, Z$ , $k (Z ∣ X, θ)$ be a conditional pdf of $Z$ given $X$

By the definition of a conditional pdf, we have the identity $k (Z ∣ X, θ) = \frac{h ( X , Z ∣ θ )}{g ( X ∣ θ )}$

The goal of the EM algorithm is maximizing the observed likelihood $L (θ ∣ X) = g (X ∣ θ)$ using the complete likelihood $L^{c} (θ ∣ X, Z) = h (X, Z ∣ θ)$ .

Using the definition conditional pdf, we derive the identity for an arbitrary but fixed $θ_{0} \in Ω$
$&= \int \ln[h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})] k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0}) d \mathbf{Z} - \int \ln[k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})]k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})d\mathbf{Z}\\ &= E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}] - E_{\boldsymbol{\theta}_{0}}[\ln k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}) | \boldsymbol{\theta}_{0}, \mathbf{X}] \end{aligned}$$ Let the first term of RHS be a quasi-likelihood function $$Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X}) := E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}]$$ EM algorithm maximizes $Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X})$ instead of maximizing $\ln L(\boldsymbol{\theta}|\mathbf{X})$ # Algorithm 1. Expectation Step: Compute $$Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X}) := E_{\hat{\boldsymbol{\theta}}^{(m)}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \hat{\boldsymbol{\theta}}_{m}, \mathbf{X}]$$ where the $m = 0, 1, \dots$, and the expectation is taken under the conditional pdf $k(\mathbf{Z} | \mathbf{X}, \hat{\boldsymbol{\theta}}^{(m)})$ 2. Maximization Step: $$\hat{\boldsymbol{\theta}}^{(m+1)} = \underset{\boldsymbol{\theta}}{\operatorname{arg max}} Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X})$$ # Properties ## Convergence The [[Sequence]] of estimates $\hat{\boldsymbol{\theta}}^{(m)}$ satisfies $$L(\hat{\boldsymbol{\theta}}^{(m+1)}|\mathbf{X}) \leq L(\hat{\boldsymbol{\theta}}^{(m)}|\mathbf{X})$$ Therefore the [[Sequence]] of EM estimates converge to (at least local) optimal$ Link to original

Sufficiency

Decision Function
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $X := u (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic for the parameter $θ$ , and $δ (x)$ be a function of the observed value of the statistic $X$ , then The function $δ$ is called a decision function or a decision rule.
Link to original

Loss Function
Definition

Let $θ \in Ω$ be a parameter, $X := u (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic for the parameter $θ$ , and $δ (x)$ be a Decision Function

A loss function is a non-negative function defined as $L (θ, δ (x))$

It indicates the difference or discrepancy between $θ$ and $δ (x)$

Examples

Absolute Error Loss

Squared Error Loss

Sum of Squared Errors Loss

Cross-Entropy Loss

Goal Post Error Loss

Huber Loss

Binary Loss

Triplet Loss

Pairwise Loss

Link to original

Risk Function
Definition

$R (θ, δ) = E [L (θ, δ (x))] = \int_{X} L (θ, δ (x)) f (x ∣ θ) d x$

Risk function is an expectation of Loss Function
Link to original

Minimum Variance Unbiased Estimator
Definition

An estimator $Y = u (X_{1}, X_{2}, \dots, X_{n})$ satisfying the following is the minimum variance unbiased estimator (MVUE) for $θ$ $(E (Y) = θ) \land (\forall T, (E (T) = θ) \Rightarrow Var (Y) \leq Var (T))$ where $T$ is an unbiased estimator

An Unbiased Estimator that has lower variance than any other unbiased estimator for the parameter $θ$

Facts

A minimum variance unbiased estimator does not always exist.

If some unbiased estimator’s variance is equal to the Rao-Cramer Lower Bound for all $θ$ , then it is a minimum variance unbiased estimator.

Link to original

Minimax Estimator
Definition

An estimator $δ^{*} (x)$ satisfying the following is he minimax estimator of $θ$ $\forall δ (x), θ max R (θ, δ^{*} (x)) \leq θ max R (θ, δ (x))$
Link to original

A Sufficient Statistic for a Parameter

Sufficient Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , and $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic with PDF $f_{Y_{1}} (y_{1} ∣ θ)$ . The $Y_{1}$ is a sufficient statistic for $θ$ if and only of $\frac{\prod _{i = 1}^{n} f ( x _{i} ∣ θ )}{f _{Y_{i}} ( y _{1} ∣ θ )} = H (x_{1}, x_{2}, \dots, x_{n})$

No other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter.

Facts

Any one-to-one transformation of a sufficient statistic is also sufficient statistic

MLE is a function of sufficient statistic

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic with PDF $f_{Y_{1}} (y_{1} ∣ θ)$ , and $\hat{θ}$ be a unique MLE of $θ$ , then $\hat{θ}$ is a function of $Y_{1}$

Link to original

Neyman's Factorization Theorem
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , then The statistic $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ is a Sufficient Statistic for $θ$ if and only if $\exists k_{1}, k_{2}, \prod_{i = 1}^{n} f (x_{i} ∣ θ) = k_{1} (y_{1} ∣ θ) k_{2} (x_{1}, x_{2}, \dots, x_{n})$ where $k_{1}, k_{2}$ are non-negative functions, and $k_{2}$ does not depend on $θ$
Link to original

Rao-Blackwell Theorem
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ be a Sufficient Statistic for $θ$ , $Y_{2} = u_{2} (X_{1}, X_{2}, \dots, X_{n})$ be a Unbiased Estimator for $θ$ . And define $φ (Y_{1}) := E (Y_{2} ∣ Y_{1})$ , then $Var [φ (Y_{1})] \leq Var (Y_{2}) \land E (φ (Y_{1})) = θ$ $φ (Y_{1})$ is an Unbiased Estimator, and it has a lower variance than $Y_{2}$
Link to original

Completeness and Uniqueness

Complete Statistic
Definition

Let $Z$ be a Random Variable with PDF $h (z ∣ θ)$ , where $θ \in Ω$ . A complete statistic $Z$ for $θ$ satisfies the condition $\forall θ \in Ω, E [u (Z)] = 0 \Rightarrow u (Z) = 0$ then, ${h (z ∣ θ) ∣ θ \in Ω}$ is called a complete family.
Link to original

Uniformly Minimum Variance Unbiased Estimator
Definition

$\forall θ \in Ω, (E (Y) = θ) \land (\forall T, (E (T) = θ) \Rightarrow Var (Y) \leq Var (T))$ where $T$ is an unbiased estimator, and $θ$ is the parameter.

An Unbiased Estimator that has lower variance than any other unbiased estimator for all possible values of the parameter $\forall θ \in Ω$

An MVUE for all possible values of the parameters

Link to original

Lehmann-Scheffe Theorem
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ be a complete sufficient Statistic for $θ$ , then If $E [φ (Y_{1})] = θ$ , then $φ (Y_{1})$ is the unique UMVUE of $θ$

Facts

Let $Y_{1}$ be a complete sufficient Statistic for $θ$ , $Y_{2}$ be a Unbiased Estimator of $q (θ)$

$\exists h, E [h (Y_{1})] = q (θ)$ , then $h (Y_{1})$ is a unique UMVUE of $q (θ)$

By the Rao-Blackwell Theorem, $E (Y_{2} ∣ Y_{1})$ is a unique UMVUE of $q (θ)$

Link to original

The Exponential Class of Distribution

Regular Exponential Class
Definition

A PDF of form $f (x ∣ θ) = exp [p (θ) K (x) + H (x) + q (θ)] I (x \in S)$ is said to be a member of the regular exponential class if

$S$ , support of $x$ , does not depend on $θ$

$p (θ)$ is a non-trivial (constant) continuous function of $θ \in Ω$

If $X$ is a continuous random variable, then $K^{'} (x) \neq = 0$ and $H (x)$ is a continuous function of $x \in S$

If $X$ is a discrete random variable, then $K (x)$ is non-trivial function of $x \in S$

Properties

Joint PDF

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a regular exponential class, then the joint pdf of $X_{1}, X_{2}, \dots, X_{n}$ is $\prod_{i = 1}^{n} f (x_{i} ∣ θ) = exp [p (θ) \sum_{i = 1}^{n} K (x_{i}) + \sum_{i = 1}^{n} H (x_{i}) + n q (θ)] \prod_{i = 1}^{n} I (x_{i} \in S)$

Monotone Likelihood Ratio

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a regular exponential class $f (x ∣ θ) = exp [p (θ) K (x) + H (x) + q (θ)] I (x \in S)$ If $p (θ)$ is monotone function of $θ$ , then likelihood ratio has Monotone Likelihood Ratio property in $y = i = 1 \sum n K (x_{i})$

Facts

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a regular exponential class, and $Y_{1} = i = 1 \sum n K (x_{i})$ be a Statistic, then

The pdf of $Y_{1}$ is $f_{Y_{1}} (y_{1} ∣ θ) = R (y_{1}) exp [p (θ) y_{1} + n q (θ)]$ where $R (y_{1})$ is a function of $Y_{1}$ only.

$E (Y_{1}) = - n \frac{q ^{'} ( θ )}{p ^{'} ( θ )}$

$Var (Y_{1}) = n \frac{1}{p ^{'} ( θ ) ^{3}} {p^{''} (θ) q^{'} (θ) - q^{''} (θ) p^{'} (θ)}$

$Y_{1}$ is a complete sufficient Statistic

Link to original

Function of a Parameter

Multiparametric Sufficiency
Definition

Joint Sufficient Statistic

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω \subset R^{p}$ , $Y = (Y_{1}, Y_{2}, \dots, Y_{m})^{⊺}$ , where $Y_{i} = u_{i} (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic, and $f_{Y} (y ∣ θ)$ be a pdf of $y$ . The $Y$ is jointly sufficient for $θ$ if and only $\frac{\prod _{i = 1}^{n} f ( x _{i} ∣ θ )}{f _{Y} ( y ∣ θ )} = H (x_{1}, x_{2}, \dots, x_{n}), \forall x_{i} \in S$ where $H (x_{1}, x_{2}, \dots, x_{n})$ does not depend on $θ$

if and only if $\prod_{i = 1}^{n} f (x_{i} ∣ θ) = k_{1} (y ∣ θ) k_{2} (x_{1}, x_{2}, \dots, x_{n}), \forall x_{i} \in S$ where $k_{2} (x_{1}, x_{2}, \dots, x_{n})$ does not depend on $θ$

Completeness

Let ${f (x_{1}, v_{2}, \dots, v_{k} ∣ θ) ∣ θ \in Ω}$ be a family of pdfs of $k$ random variables $V_{1}, V_{2}, \dots, V_{n}$ If $\forall θ \in Ω, E [u (v_{1}, v_{2}, \dots, v_{n})] = 0 \Rightarrow u (v_{1}, v_{2}, \dots, v_{n}) = 0$ , then the family of pdfs is called complete family, and $V_{1}, V_{2}, \dots, V_{k}$ are complete statistics for $θ$

m-Dimensioanl Regular Exponential Class

Let $X$ be a Random Variable with PDF $f (x ∣ θ)$ , where $θ \in Ω \subset R^{m}$ , and $S$ be a support of the $X$ . If $f (x ∣ θ)$ can be written as $f (x ∣ θ) = exp [j = 1 \sum m p_{j} (θ) K_{j} (x) + H (x) + q (θ_{1}, θ_{2}, \dots, θ_{m})] I_{S} (x)$ then, it is called $m$ -dimensional exponential class. Further, it is called a regular case if the following are satisfied

$S$ , support of $x$ , does not depend on $θ$

$Ω$ contains m-dimensional open rectangle

$p_{1} (θ), p_{2} (θ), p_{m} (θ)$ are functionally independent and continuous function of $θ$

If $X$ is a continuous random variable, then $K_{j}^{'} (x)$ and $H (x)$ are continuous function

If $X$ is a discrete random variable, then $K_{j} (x)$ are non-trivial function of $x \in S$

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a m-dimensional regular exponential class, then the joint pdf of $X_{1}, X_{2}, \dots, X_{n}$ is $\prod_{i = 1}^{n} f (x_{i} ∣ θ) = exp [\sum_{j = 1}^{m} p_{j} (θ) (\sum_{i = 1}^{n} K_{j} (x_{i})) + \sum_{i = 1}^{n} H (x_{i}) + n q (θ)] \prod_{i = 1}^{n} I (x_{i} \in S)$

$Y = (Y_{1}, Y_{2}, \dots, Y_{m})^{⊺}$ , where $Y_{j} = i = 1 \sum n K_{j} (x_{i})$ , is a joint complete sufficient Statistic for $θ$

the joint pdf of $Y = (Y_{1}, Y_{2}, \dots, Y_{m})^{⊺}$ is $f_{Y} (y ∣ θ) = R (y) exp [j = 1 \sum m p_{j} (θ) y_{j} + n q (θ)]$ , where $R (y)$ does not depend on $θ$

k-Dimensional Random Vector with p-Dimensional Parameters

Exponential Class

Let $X$ be a $k$ -dimensional random vector with pdf $f (x ∣ θ)$ , where $θ \in Ω \subset R^{p}$ $f (x ∣ θ) = exp [j = 1 \sum m p_{j} (θ) K_{j} (x) + H (x) + q (θ)] I_{S} (x)$
Link to original

Minimum Sufficiency and Ancillary Statistic

Minimal Sufficient Statistic
Definition

A Sufficient Statistic is called minimal sufficient statistic (MSS) if it has the minimum dimension among all sufficient statistics

Facts

complete sufficient Statistic $\times ⇌ \circ$ minimal sufficient statistic

Link to original

Ancillary Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ $Y_{1} = u_{1} (X_{1}, X_{2}, \dots, X_{n})$ is called ancillary statistic to $θ$ if its distribution does not depend on the $θ$

Link to original

Location Parameter
Definition

Let $X_{i} = θ + W_{i}$ , where $W_{1}, W_{2}, \dots, W_{n}$ , be Random Sample from a pdf $f_{W} (w)$ If $f_{X_{i}} (x) = D f_{W} (x - θ)$ , then $θ$ is called location parameter
Link to original

Location Invariant Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a location model, and $Z = u (X_{1}, X_{2}, \dots, X_{n})$ be a statistic such that $\forall d \in R, u (x_{1} + d, x_{2} + d, \dots, x_{n} + d) = u (x_{1}, x_{2}, \dots, x_{n})$ , then $Z = u (θ + W_{1}, θ + W_{2}, \dots, θ + W_{n}) = u (W_{1}, W_{2}, \dots, W_{n})$ . So, the distribution of $Z$ does not depend on $θ$ . In this case, $Z$ is Ancillary Statistic to $θ$ , and it is called location-invariant statistic.
Link to original

Scale Parameter
Definition

Let $X_{i} = θ W_{i}$ , where $W_{1}, W_{2}, \dots, W_{n}$ , be Random Sample from a pdf $f_{W} (w)$ If $f_{X_{i}} (x) = \frac{1}{θ} f_{W} (\frac{x}{θ})$ , then $θ$ is called scale parameter
Link to original

Scale-Invariant Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a scale model, and $Z = u (X_{1}, X_{2}, \dots, X_{n})$ be a statistic such that $\forall c \in R^{+}, u (c x_{1}, c x_{2}, \dots, c x_{n}) = u (x_{1}, x_{2}, \dots, x_{n})$ , then $Z = u (θ W_{1}, θ W_{2}, \dots, θ W_{n}) = u (W_{1}, W_{2}, \dots, W_{n})$ . So, the distribution of $Z$ does not depend on $θ$ . In this case, $Z$ is Ancillary Statistic to $θ$ , and it is called scale-invariant statistic.
Link to original

Location-Scale Family
Definition

Let $X_{i} = θ_{1} + θ_{2} W_{i}$ , where $W_{1}, W_{2}, \dots, W_{n}$ , be Random Sample from a pdf $f_{W} (w)$ If $f_{X_{i}} (x) = D \frac{1}{θ _{2}} f_{W} (\frac{x - θ _{1}}{θ _{2}})$ , then $θ_{1}$ is called location parameter and $θ_{2}$ is called scale parameter
Link to original

Location- and Scale-Invariant Statistic
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from a location-scale model, and $Z = u (X_{1}, X_{2}, \dots, X_{n})$ be a statistic such that $\forall d \in R, \forall c \in R^{+}, u (c x_{1} + d, c x_{2} + d, \dots, c x_{n} + d) = u (x_{1}, x_{2}, \dots, x_{n})$ , then $Z = u (θ_{1} + θ_{2} W_{1}, θ_{1} + θ_{2} W_{2}, \dots, θ_{1} + θ_{2} W_{n}) = u (W_{1}, W_{2}, \dots, W_{n})$ . So, the distribution of $Z$ does not depend on $θ_{1}, θ_{2}$ . In this case, $Z$ is Ancillary Statistic to $θ$ , and it is called location- and scale-invariant statistic.
Link to original

Sufficiency, Completeness, and Independence

Basu's Theorem
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $Y$ be a complete sufficient Statistic for $θ$ , and $Z$ be a Ancillary Statistic to $θ$ , then UMP $Y$ and $Z$ are independent.
Link to original

Optimal Tests of Hypothesis

Most Powerful Test

Best Critical Region
Definition

Let $C$ be a subset of sample space. $C$ is a best critical region of size $α$ for testing the simple hypothesis test $H_{0} : θ = θ^{'}, H_{1} : θ = θ^{''}$ if

$P_{θ^{'}} [X \in C] = α$ (level alpha condition)

$\forall A, P_{θ^{'}} (X \in α) \Rightarrow P_{θ^{''}} (X \in C) \geq P_{θ^{''}} (X \in A)$ (most powerful)

Link to original

Most Powerful Test
Definition

A test with the Best Critical Region is called most powerful (MP) test
Link to original

Neyman-Pearson Theorem
Definition

The Neyman-Pearson theorem states that the Most Powerful Test for choosing between two simple hypotheses is based on the likelihood ratio.

Let $X_{1}, X_{2}, \dots, X_{n}$ be random samples with PDF $f (x ∣ θ)$ , where $θ \in Ω = {θ_{0}, θ_{1}}$ , then the Likelihood Function of $X_{1}, X_{2}, \dots, X_{n}$ is defined as $L (θ ∣ x) = i = 1 \prod n f (x_{i} ∣ θ)$ Consider the simple null hypothesis $H_{0} : θ = θ_{0}$ against the simple alternative hypothesis $θ = θ_{1}$ and define the likelihood ratio test statistic. $Λ (x) = \frac{L ( θ _{1} ∣ x )}{L ( θ _{0} ∣ x )}$

Let $k \in R^{+}$ , and $C$ be a subset of the sample space such that

$\forall x \in C^{∁}, Λ (x) \leq k$

$\forall x \in C, Λ (x) \geq k$

$P_{θ_{0}} (X \in C) = α$ where $x^{⊺} = (x_{1}, x_{2}, \dots, x_{n})$

Then, $C$ is a Best Critical Region of size $α$ for the hypothesis test.
Link to original

Uniformly Most Powerful Tests

Uniformly Most Powerful Critical Region
Definition

If $C$ is a Best Critical Region of size $α$ for every $H_{0}$ in $H_{1}$ , then $C$ is called a uniformly most powerful critical region
Link to original

Uniformly Most Powerful Test
Definition

A test defined by Uniformly Most Powerful Critical Region is called a uniformly most powerful (UMP) test

Simple vs Composite

Consider $H_{0} : θ = θ^{'}$ is a simple hypothesis and $H_{1} : θ > θ^{'}, (θ \neq = θ^{'}, θ < θ^{'})$ is a composite hypothesis. Fix some $θ^{''}$ satisfying $H_{1}$ and obtain a most powerful test using Neyman-Pearson Theorem and generalize it to all $θ^{''}$ satisfying $H_{1}$ conditions

Composite vs Composite

Karlin-Rubin Theorem
Definition

Consider the hypothesis $H_{0} : θ \leq θ^{'}, (θ \geq θ^{'})$ and $H_{1} : θ > θ^{'} (θ < θ^{'})$ , and assume that the likelihood ratio has MLR property in some Sufficient Statistic $Y = u (x)$ If the likelihood ratio is decreasing (increasing) in the statistic $Y = u (x)$ Then the Uniformly Most Powerful Test of level $α$ for the hypothesis is rejecting $H_{0}$ if $Y \geq c, (Y \leq c)$ , where $c$ is determined by $α = P_{θ^{'}} (Y \geq c)$
Link to original
Link to original

Monotone Likelihood Ratio
Definition

$L (θ ∣ x)$ has a monotone likelihood ratio (MLR) property in $y = u (x)$ If $\forall θ_{1} < θ_{2}, \frac{L ( θ _{1} ∣ x )}{L ( θ _{2} ∣ x )}$ is a monotone function of $y = u (x)$
Link to original

Karlin-Rubin Theorem
Definition

Consider the hypothesis $H_{0} : θ \leq θ^{'}, (θ \geq θ^{'})$ and $H_{1} : θ > θ^{'} (θ < θ^{'})$ , and assume that the likelihood ratio has MLR property in some Sufficient Statistic $Y = u (x)$ If the likelihood ratio is decreasing (increasing) in the statistic $Y = u (x)$ Then the Uniformly Most Powerful Test of level $α$ for the hypothesis is rejecting $H_{0}$ if $Y \geq c, (Y \leq c)$ , where $c$ is determined by $α = P_{θ^{'}} (Y \geq c)$
Link to original

The Sequential Probability Ratio Test

Sequential Probability Ratio Test
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω = {θ^{'}, θ^{''}}$ , and $L (θ ∣ n) = \prod_{i = 1}^{n} f (x_{i} ∣ θ)$ be the Likelihood Function Consider a simple test $H_{0} : θ = θ^{'}, H_{1} : θ = θ^{''}$ We observe likelihood ratio sequentially i.e. $\frac{L ( θ ^{'} ∣1 )}{L ( θ ^{''} ∣1 )}, \frac{L ( θ ^{'} ∣2 )}{L ( θ ^{''} ∣2 )}, \dots$ , and reject $H_{0}$ if and only if $x = (x_{1}, x_{2}, \dots, x_{n}) \in C_{n}$ , where $C_{n} = {(x ∣\forall j = {1, 2, \dots, n - 1}, k_{0} < \frac{L ( θ ^{'} ∣ j )}{L ( θ ^{''} ∣ j )} < k_{1} \land \frac{L ( θ ^{'} ∣ n )}{L ( θ ^{''} ∣ n )} \leq k_{0}}$ , and do not reject $H_{0}$ if and only if $x = (x_{1}, x_{2}, \dots, x_{n}) \in B_{n}$ , where $B_{n} = {(x ∣\forall j = {1, 2, \dots, n - 1}, k_{0} < \frac{L ( θ ^{'} ∣ j )}{L ( θ ^{''} ∣ j )} < k_{1} \land \frac{L ( θ ^{'} ∣ n )}{L ( θ ^{''} ∣ n )} \geq k_{1}}$

i.e. we continue to observe as long as $k_{0} < \frac{L ( θ ^{'} ∣ j )}{L ( θ ^{''} ∣ j )} < k_{1}$ , and stop otherwise.

Choice of $k_{0}$ and $k_{1}$

Let $α, β$ be type 1 and 2 error, respectively, then $α = i = 1 \sum \infty \int_{C_{n}} L (θ^{'} ∣ n), β = i = 1 \sum \infty \int_{B_{n}} L (θ^{''} ∣ n), 1 - α = i = 1 \sum \infty \int_{B_{n}} L (θ^{'} ∣ n), 1 - β = i = 1 \sum \infty \int_{C_{n}} L (θ^{''} ∣ n)$ Hence, $α = i = 1 \sum \infty \int_{C_{n}} L (θ^{'} ∣ n) \leq i = 1 \sum \infty \int_{C_{n}} k_{0} L (θ^{''} ∣ n) = k_{0} (1 - β)$ $1 - α = i = 1 \sum \infty \int_{B_{n}} L (θ^{'} ∣ n) \geq i = 1 \sum \infty \int_{B_{n}} k_{1} L (θ^{''} ∣ n) = k_{1} β$ Therefore, $\frac{α}{1 - β} < k_{0} < k_{1} < \frac{1 - α}{β}$ By the inequality, we choose $k_{0} := \frac{α}{1 - β}, k_{1} := \frac{1 - α}{β}$
Link to original

Minimax and Classification Procedures

Minimax Procedures
Definition

We want to test $H_{0} : θ = θ^{'}, H_{1} : θ = θ^{''}$ Let $L (θ, δ)$ be a Loss Function such that $L (θ^{'}, θ^{'}) = L (θ^{''}, θ^{''}) = 0$ and $L (θ^{'}, θ^{''}) > 0, L (θ^{''}, θ^{'}) > 0$ , $C$ be a rejection region, $C^{∁}$ be an acceptance region, and $R (θ, δ) = E [L (θ, δ)]$ be a Risk Function Then, $R (θ^{'}, C) = α L (θ^{'}, θ^{''}), R (θ^{''}, C) = β L (θ^{''}, θ^{'})$ The minimax procedure is to find $C$ such that $max {R (θ^{'}, C), R (θ^{''}, C)}$ is minimized

The minimax solution is the region $C = {(x_{1}, x_{2}, \dots, x_{n}) ∣ \frac{L ( θ ^{'} )}{L ( θ ^{''} )} \leq k}$ , where $k$ satisfies $R (θ^{'}, C) = R (θ^{''}, C)$
Link to original

Inferences About Normal Linear Models

Quadratic Form
Definition

$q (x) = x^{⊺} Ax$ where $A$ is a Symmetric Matrix

A mapping $Q : V \to K$ where $V$ is a Module on Commutative Ring $K$ that has the following properties. $\forall k, l \in K, \forall u, v, w \in V$

$Q (k v) = k^{2} Q (v)$

$Q (u + v + w) = Q (u + v) + Q (v + w) + Q (u + w) - Q (u) - Q (v) - Q (w)$

$Q (k u + l v) = k^{2} Q (u) + l^{2} Q (v) + k lQ (u + v) - k lQ (u) - k lQ (v)$

Matrix Expressions

$a_{1} x_{1}^{2} + a_{2} x_{2}^{2} + 2 a_{3} x_{1} x_{2} \Leftrightarrow [x_{1} x_{2}] [a_{1} a_{3} a_{3} a_{2}] [x_{1} x_{2}]$

$a_{1} x_{1}^{2} + a_{2} x_{2}^{2} + a_{3} x_{3}^{2} + 2 a_{4} x_{1} x_{2} + 2 a_{5} x_{1} x_{3} + 2 a_{6} x_{2} x_{3} \Leftrightarrow [x_{1} x_{2} x_{3}] a_{1} a_{4} a_{5} a_{4} a_{2} a_{6} a_{5} a_{6} a_{3} x_{1} x_{2} x_{3}$

Facts

Let $x$ be a Random Vector and $A$ be a symmetric matrix of constants. If $E (x) = μ$ and $Var (x) = Σ$ , then the expectation of the quadratic form $Q := x^{⊺} A x$ is $E (Q) = μ^{⊺} A μ + tr (A Σ)$

Let $x = (X_{1}, \dots, X_{n})^{⊺}$ be a Random Vector and $A$ be a symmetric matrix of constants. If $E (x) = θ = (θ_{1}, \dots, θ_{n})^{⊺}$ and $Var (X_{i}) = μ_{2}$ , $E [(X_{i} - θ_{i})^{3}] = μ_{3}$ , and $E [(X_{i} - θ_{i})^{4}] = μ_{4}$ , where $i = 1, 2, \dots, n$ , then the variance of the quadratic form $x^{⊺} A x$ is $Var (x^{⊺} A x) = (μ_{4} - 3 μ_{2}^{2}) a^{⊺} a + 2 μ_{2}^{2} tr (A^{2}) + 4 μ_{2} θ^{⊺} A^{2} θ + 4 μ_{3} θ^{⊺} A a$ where $a$ is the column vector of diagonal elements of $A$ .

If $X_{i} \sim N (θ_{i}, μ_{2}), i = 1, \dots, n$ and $X_{i}$ ‘s are independent, then $Var (x^{⊺} A x) = 2 μ_{2}^{2} tr (A^{2}) + 4 μ_{2} θ^{⊺} A^{2} θ$ If $X_{i} \sim N (0, 1), i = 1, \dots, n$ and $X_{i}$ ‘s are independent, then $Var (x^{⊺} A x) = 2 tr (A^{2})$

Let $X^{⊺} \sim N (0, σ^{2} I_{n})$ , $Q := X^{⊺} A X / σ^{2}$ , where $A$ is a Symmetric Matrix and $rank (A) = r \leq n$ , then the MGF of $Q$ is $M (t) = \prod_{i = 1}^{r} (1 - 2 t λ_{i})^{- 1/2} = ∣ I - 2 t A ∣^{- 1/2}$ where $λ_{i}$ ‘s are non-zero eigenvalue of $A$

Let $X \sim N_{n} (μ, Σ)$ , where $Σ$ is Positive-Definite Matrix, then $Q = (X - μ)^{⊺} Σ^{- 1} (X - μ) \sim χ^{2} (n)$

Let $X \sim N (0, σ^{2} I_{n})$ , $Q = X^{⊺} A X / σ^{2}$ , where $A$ is a Symmetric Matrix and $rank (A) = r \leq n$ , then $Q \sim χ^{2} (r) \Leftrightarrow A = A^{k}$ where $k \in N$

Let $X \sim N (0, σ^{2} I_{n})$ , $Q_{1} = X^{⊺} A X / σ^{2}, Q_{2} = X^{⊺} B X / σ^{2}$ , where $A, B$ are symmetric matrices, then $Q_{1}, Q_{2}$ are independent if and only if $A B = O$

Let $Q = Q_{1} + Q_{2} + \dots + Q_{k - 1} + Q_{k}$ , where $Q, Q_{1}, Q_{2}, \dots, Q_{k}$ are quadratic forms in Random Sample from $N (0, σ^{2})$ If $Q / σ^{2} \sim χ^{2} (r), Q_{1} / σ^{2} \sim χ^{2} (r_{1}), Q_{2} / σ^{2} \sim χ^{2} (r_{2}), \dots, Q_{k - 1} / σ^{2} \sim χ^{2} (r_{k - 1})$ and $Q_{k}$ is non-negative, then

$Q_{1}, Q_{2}, \dots, Q_{k}$ are independent

$Q_{k} / σ^{2} \sim χ^{2} (r - r_{1} - r_{2} - \dots - r_{k - 1})$

Let $X = (X_{1}, X_{2}, \dots, X_{n}) \sim N (0, σ^{2} I_{n})$ , $i = 1 \sum n X_{i}^{2} = Q_{1} + Q_{2} + \dots + Q_{k}$ , where $Q_{j} = X^{⊺} A_{j} X$ , where $rank (A_{j}) = r_{j} \leq n$ , then $\forall j = {1, 2, \dots, k}, \frac{Q _{j}}{σ ^{2}} \sim χ^{2} (r_{j}) \Leftrightarrow \sum_{j = 1}^{k} r_{j} = n$

$x^{⊺} A x - 2 x^{⊺} b = (x - A^{- 1} b)^{⊺} A (x - A^{- 1} b) - b^{⊺} A^{- 1} b$

Link to original

One-way ANOVA

One-way ANOVA

One-way ANOVA is used to analyze the significance of differences of means between groups. Let an $i$ -th response of the $j$ -th group be $X_{ij} = μ_{j} + ϵ_{ij}$ where $ϵ_{ij} \sim N (0, σ^{2}), \forall i = 1, 2, \dots, n_{j}, \forall j = 1, 2, \dots, c$

We want to test the null hypothesis $H_{0} : μ_{1} = μ_{2} = \dots = μ_{c}$ i.e. there is no treatment effect. $F = \frac{j = 1 \sum c n _{j} ( X ˉ _{. j} - X ˉ _{..} ) ^{2} / ( c - 1 )}{j = 1 \sum c i = 1 \sum n _{j} ( X _{ij} - X ˉ _{. j} ) ^{2} / ( N - c )} = \frac{SSB / ( c - 1 )}{SS W / ( N - c )} \sim F (c - 1, N - c)$ where $N = j = 1 \sum c n_{j}$ is the total number of observations, $\overset{ˉ}{X}_{. j}$ represents the mean of the $j$ -th group and $\overset{ˉ}{X}_{..}$ represents the overall mean, $\frac{1}{σ ^{2}} j = 1 \sum c n_{j} (\overset{ˉ}{X}_{. j} - \overset{ˉ}{X}_{..})^{2} \sim χ^{2} (c - 1)$ of the numerator indicates between-group variance (SSB) and $\frac{1}{σ ^{2}} j = 1 \sum c i = 1 \sum n_{j} (X_{ij} - \overset{ˉ}{X}_{. j})^{2} \sim χ^{2} (N - c)$ of the denominator indicates within-group variance (SSW)

The Likelihood Ratio Test rejects $H_{0}$ if $F > F_{α} (c - 1, N - c)$
Link to original

Noncentral Chi-squared and F distributions

Noncentral Chi-squared Distribution
Definition

Let $X_{1}, X_{2}, \dots, X_{k}$ be $k$ independent normal distributions $N (μ_{i}, σ^{2})$ , then $Y := \frac{\sum _{i = 1}^{k} X _{i}^{2}}{σ ^{2}} \sim χ^{2} (k, λ)$ where $k \in N$ is degree of freedom, and $λ := i = 1 \sum k \frac{μ _{i}^{2}}{σ ^{2}} \in R^{+}$ is non-centrality parameter.

Properties

PDF

MGF

$\frac{1}{( 1 - 2 t ) ^{k /2}} exp (\frac{λ t}{1 - 2 t})$
Link to original

Noncentral F-Distribution
Definition

Let $U \sim χ^{2} (r_{1}, λ), V \sim χ^{2} (r_{2})$ be independent random variables following Noncentral Chi-squared Distribution and Chi-squared Distribution, respectively, then $\frac{U / r _{1}}{V / r _{2}} \sim F (r_{1}, r_{2}, λ)$ where $r_{1}, r_{2} \in N$ are degree of freedoms and $λ \in R^{+}$ is non-centrality parameter.

Properties

PDF

Link to original

Multiple Comparisons

Multiple Comparisons
Definition

Assume that $H_{0} : μ_{1} = μ_{2} = \dots = μ_{K}$ is rejected, and we interested in the confidence interval for some linear combination of parameters

Let $X_{1 j}, X_{2 j}, \dots, X_{aj}$ ‘s be random samples from $N (μ_{j}, σ^{2}), j = 1, 2, \dots, b$ We want to compute the confidence interval of $j = 1 \sum b k_{j} μ_{j}$ , where $k_{j} \in R$

Type 1 Error

Consider $m$ multiple testing problem $H_{0 i} : μ_{1 i} = μ_{2 i}, i = 1, 2, \dots, m$ , and let $E_{i}$ be an event of rejecting $H_{0 i}$ even though $H_{0 i}$ is true (type 1 error). If a test satisfies $P (E_{i}) \leq α$ for level $α$ , then the test is called to be controlled at $α$ Under the independence of each test, the probability of type 1 error in $m$ multiple testing is $P (i = 1 ⋃ m E_{i}) = 1 - P (i = 1 ⋂ m E_{i}^{∁}) = 1 - \prod_{i = 1}^{m} P (E_{i}^{∁}) \leq 1 - (1 - α)^{m}$ Thus, the probability of type 1 error approaches to $1$ as $m$ increase.

Fixed Constants Case

The $k_{j}$ ‘s are fixed constants. The $100 (1 - α) %$ confidence interval for each $j = 1 \sum b k_{j} μ_{j}$ is $j = 1 \sum b k_{j} \overline{X}_{. j} \pm t_{α /2} (b (a - 1)) (\sum_{j = 1}^{b} k_{j}^{2}) \frac{V}{a}$ where $V = \frac{1}{b ( a - 1 )} i = 1 \sum a j = 1 \sum b (X_{ij} - \overline{X}_{. j})^{2}$

Scheffe’s Simultaneous Confidence Interval

The $k_{j}$ ‘s are allowed to have any real numbers. The simultaneous $100 (1 - α) %$ confidence interval for all $j = 1 \sum b k_{j} μ_{j}$ is $j = 1 \sum b k_{j} \overline{X}_{. j} \pm \frac{b}{a} V F_{α} (j = 1 \sum b k_{j}^{2})$ where $V = \frac{1}{b ( a - 1 )} i = 1 \sum a j = 1 \sum b (X_{ij} - \overline{X}_{. j})^{2}$
Link to original

The Analysis of Variance

Two-way ANOVA
Definition

Two-way ANOVA Without Replications

$1$ $2$ $\dots$ $b$
$1$ $X_{11}$ $X_{12}$ $\dots$ $X_{1 b}$
$2$ $X_{21}$ $X_{22}$ $\dots$ $X_{2 b}$
$⋮$ $⋮$ $⋮$ $⋱$ $⋮$
$a$ $X_{a 1}$ $X_{a 2}$ $\dots$ $X_{ab}$

Let $X_{ij}$ ‘s be Random Sample from $N (μ_{ij}, σ^{2})$ . The $μ_{ij} = μ + α_{i} + β_{j}$ can be decomposed as sum of means (global mean $+$ $i$ -th level effect of factor $A$ $+$ $j$ -th level effect of factor $B$ ). In this setup, we assume that $i = 1 \sum a α_{i} = j = 1 \sum b β_{j} = 0$

We want to test $H_{0} : β_{1}, β_{2} = \dots = β_{b}$ . $F = \frac{i = 1 \sum a j = 1 \sum b ( X _{. j} - X _{..} ) ^{2} / ( b - 1 )}{i = 1 \sum a j = 1 \sum b ( X _{ij} - X _{i .} - X _{. j} + X _{..} ) ^{2} / ( a - 1 ) ( b - 1 )} \sim F ((b - 1), (a - 1) (b - 1))$ where $i = 1 \sum a j = 1 \sum b (\overline{X}_{. j} - \overline{X}_{..})^{2} \sim χ^{2} (b - 1)$ , and $i = 1 \sum a j = 1 \sum b (X_{ij} - \overline{X}_{i .} - \overline{X}_{. j} + \overline{X}_{..})^{2} \sim χ^{2} ((a - 1) (b - 1))$

The Likelihood Ratio Test rejects $H_{0}$ if $F > F_{α} ((b - 1), (a - 1) (b - 1))$

Under $H_{1}$ , the $F$ follows Noncentral F-Distribution $F^{'} = \frac{a j = 1 \sum b ( X _{. j} - X _{..} ) ^{2} / ( b - 1 )}{i = 1 \sum a j = 1 \sum b ( X _{ij} - X _{i .} - X _{. j} + X _{..} ) ^{2} / ( a - 1 ) ( b - 1 )} \sim F ((b - 1), (a - 1) (b - 1), a \sum_{j = 1}^{b} β_{j}^{2} / σ^{2})$ where $\frac{1}{σ ^{2}} a j = 1 \sum b (\overline{X}_{. j} - \overline{X}_{..})^{2} \sim χ^{2} (b - 1, a j = 1 \sum b β_{j}^{2} / σ^{2})$

Therefore, the power of the test is $P (F^{'} > F_{α})$

Two-way ANOVA With Equal Replications

$1$ $2$ $\dots$ $b$
$1$ $X_{111}, \dots, X_{11 n}$ $X_{121}, \dots, X_{12 n}$ $\dots$ $X_{1 b 1}, \dots, X_{1 bn}$
$2$ $X_{211}, \dots, X_{21 n}$ $X_{221}, \dots, X_{22 n}$ $\dots$ $X_{2 b 1}, \dots, X_{1 bn}$
$⋮$ $⋮$ $⋮$ $⋱$ $⋮$
$a$ $X_{a 11}, \dots, X_{a 1 n}$ $X_{a 21}, \dots, X_{a 2 n}$ $\dots$ $X_{ab 1}, \dots, X_{abn}$

Let $X_{ijk}$ ‘s be Random Sample from $N (μ_{ij}, σ^{2})$ . The $μ_{ijk} = μ + α_{i} + β_{j} + γ_{ij}$ can be decomposed as sum of means (global mean + $i$ -th level effect of factor $A$ + $j$ -th level effect of factor $B$ + interaction effect of $i$ -th level of factor $A$ and the $j$ -th level of factor $B$ ). In this setup, we assume that $i = 1 \sum a α_{i} = j = 1 \sum b β_{j} = i = 1 \sum a γ_{ij} = j = 1 \sum b γ_{ij} = 0$

We want to test $H_{0} : γ_{ij} = 0, \forall i, j$ . $F = \frac{i = 1 \sum a j = 1 \sum b k = 1 \sum n ( X _{ij .} - X _{i ..} - X _{. j .} + X _{...} ) ^{2} / ( a - 1 ) ( b - 1 )}{i = 1 \sum a j = 1 \sum b k = 1 \sum n ( X _{ijk} - X _{ij .} ) ^{2} / ( N - ab )} \sim F ((a - 1) (b - 1), N - ab)$ where $N = i = 1 \sum a j = 1 \sum b n = abn$ , $i = 1 \sum a j = 1 \sum b k = 1 \sum n (\overline{X}_{ij .} - \overline{X}_{i ..} - \overline{X}_{. j .} + \overline{X}_{...})^{2} \sim χ^{2} ((a - 1) (b - 1))$ , and $i = 1 \sum a j = 1 \sum b k = 1 \sum n (X_{ijk} - \overline{X}_{ij .})^{2} \sim χ^{2} ((N - ab))$

The Likelihood Ratio Test rejects $H_{0}$ if $F > F_{α} ((a - 1) (b - 1), (N - ab))$

Under $H_{1}$ , the $F$ follows Noncentral F-Distribution $F^{'} = \frac{n i = 1 \sum a j = 1 \sum b ( X _{ij .} - X _{i ..} - X _{. j .} + X _{...} ) ^{2} / ( a - 1 ) ( b - 1 )}{i = 1 \sum a j = 1 \sum b k = 1 \sum n ( X _{ijk} - X _{ij .} ) ^{2} / ( N - ab )} \sim F ((a - 1) (b - 1), (N - ab), n i = 1 \sum a j = 1 \sum b γ_{ij}^{2} / σ^{2})$ where $n i = 1 \sum a j = 1 \sum b (\overline{X}_{ij .} - \overline{X}_{i ..} - \overline{X}_{. j .} + \overline{X}_{...})^{2} \sim χ^{2} ((a - 1) (b - 1), n i = 1 \sum a j = 1 \sum b γ_{ij}^{2} / σ^{2})$

Therefore, the power of the test is $P (F^{'} > F_{α})$

If $H_{0} : γ_{ij} = 0, \forall i, j$ is not rejected, then we continue to test $H_{0} : α_{1} = α_{2} = \dots = α_{a} = 0$ or $H_{0} : β_{1} = β_{2} = \dots = β_{b} = 0$

Two-way ANOVA with a Regression Model

$Y = β_{0} + i = 1 \sum a - 1 α_{i} D_{i} + j = 1 \sum b - 1 β_{j} E_{j} + i = 1 \sum a - 1 j = 1 \sum b - 1 γ_{ij} D_{i} E_{j} + ϵ$ where $D_{i}$ and $E_{j}$ are dummy variables representing categories of the two factors, $a$ is the number of categories for the first factor, and $b$ is the number of categories for the second factor

In this setting, a corner-point constraint is used for both factors.

The null hypotheses can be tested by Deviance

Deviance Test for Two-way ANOVA

For a two-way ANOVA with factors $A$ and $B$ , we have three null hypotheses to test:

$H_{0 A} : α_{1} = α_{2} = α_{a - 1} = 0$ i.e. there is no treatment effect of factor $A$

$H_{0 B} : β_{1} = β_{2} = β_{b - 1} = 0$ i.e. there is no treatment effect of factor $B$

$H_{0 A B} : γ_{ij} = 0, \forall i, j$ i.e. there is no interaction effect between factor $A$ and $B$ They can be tested with the Deviance.

If $σ^{2}$ is known, the test statistic for $H_{0 A}$ is defined as $Δ D_{A} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{bn} i = 1 \sum a y_{i ..}^{2} - \frac{1}{N} Y_{...}^{2}) \sim χ^{2} (a - 1)$ And reject the $H_{0 A}$ if $Δ D_{A} > χ^{2} (a - 1)$

If $σ^{2}$ is unknown, the test statistic for $H_{0 A}$ is defined as $F_{A} = \frac{D _{0} - D _{1}}{a - 1} / \frac{D _{1}}{N - ab} \sim F (a - 1, N - ab)$ And reject the $H_{0 A}$ if $F_{A} > F (a - 1, N - ab)$

If $σ^{2}$ is known, the test statistic for $H_{0 B}$ is defined as $Δ D_{B} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{an} j = 1 \sum b y_{. j .}^{2} - \frac{1}{N} Y_{...}^{2}) \sim χ^{2} (b - 1)$ And reject the $H_{0 B}$ if $Δ D_{B} > χ^{2} (b - 1)$

If $σ^{2}$ is unknown, the test statistic for $H_{0 B}$ is defined as $F_{B} = \frac{D _{0} - D _{1}}{b - 1} / \frac{D _{1}}{N - ab} \sim F (b - 1, N - ab)$ And reject the $H_{0 B}$ if $F_{B} > F (b - 1, N - ab)$

If $σ^{2}$ is known, the test statistic for $H_{0 A B}$ is defined as $Δ D_{A B} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{n} i = 1 \sum a j = 1 \sum b y_{ij .}^{2} - \frac{1}{bn} i = 1 \sum a Y_{i ..}^{2} - \frac{1}{an} j = 1 \sum b Y_{. j .}^{2} + \frac{1}{N} Y_{...}^{2}) \sim χ^{2} ((a - 1) (b - 1))$ And reject the $H_{0 A B}$ if $Δ D_{A B} > χ^{2} ((a - 1) (b - 1))$

If $σ^{2}$ is unknown, the test statistic for $H_{0 A B}$ is defined as $F_{A B} = \frac{D _{0} - D _{1}}{( a - 1 ) ( b - 1 )} / \frac{D _{1}}{N - ab} \sim F ((a - 1) (b - 1), N - ab)$ And reject the $H_{0 A B}$ if $F_{A B} > F ((a - 1) (b - 1), N - ab)$
Link to original

	$1$	$2$	$\dots$	$b$
$1$	$X_{11}$	$X_{12}$	$\dots$	$X_{1 b}$
$2$	$X_{21}$	$X_{22}$	$\dots$	$X_{2 b}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$a$	$X_{a 1}$	$X_{a 2}$	$\dots$	$X_{ab}$

	$1$	$2$	$\dots$	$b$
$1$	$X_{111}, \dots, X_{11 n}$	$X_{121}, \dots, X_{12 n}$	$\dots$	$X_{1 b 1}, \dots, X_{1 bn}$
$2$	$X_{211}, \dots, X_{21 n}$	$X_{221}, \dots, X_{22 n}$	$\dots$	$X_{2 b 1}, \dots, X_{1 bn}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$a$	$X_{a 11}, \dots, X_{a 1 n}$	$X_{a 21}, \dots, X_{a 2 n}$	$\dots$	$X_{ab 1}, \dots, X_{abn}$

Nonparametric Statistics

Functional (Statistics)
Definition

$T$ is called a functional if $T$ is a function of CDF or PDF

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample from a cdf $F (x)$ and $T (F)$ be functional, then the empirical Distribution Function of $F$ at $x$ is $\hat{F}_{n} (x) = \frac{1}{n} i = 1 \sum n I (X_{i} \leq x)$ , and $T (\hat{F}_{n})$ is called an induced estimator of $T (F)$ .
Link to original

Location Model
Definition

Location Functional

Let $X$ be continuous Random Sample with CDF $F_{X}$ . If $T (F_{X})$ satisfies the following, then it is called location functional

$\forall a \in R, Y = X + a \Rightarrow T (F_{Y}) = T (F_{X}) + a$

$\forall a \neq = 0, Y = a X \Rightarrow T (F_{Y}) = a T (F_{X})$

Location Model

Let $θ_{X} = T (F_{X})$ be a location functional, then $X_{1}, X_{2}, \dots, X_{n}$ is called to follow a location model with $θ_{X} = T (F_{X})$ if

$X_{i} = θ_{X} + ϵ_{i}$ , where $ϵ_{i}$ ‘s are i.i.d. with PDF $f (x)$ and $T (F_{ϵ}) = 0$

Facts

Mean functional is a locational functional

Let $X$ be Random Variable with CDF $F_{X}$ and PDF $f_{x}$ which is symmetric about $a$ . If $T (F_{X})$ is a location functional, then $T (F_{X}) = a$

Link to original

Sample Median and Sign Test

Sign Test
Definition

Let $X_{i} = θ + ϵ_{i}$ , where $ϵ_{i}$ ‘s are i.i.d. with CDF $F_{X}$ and PDF $f_{x}$ with median $0$ . i.e. median of $X_{i}$ ‘s are $θ$ .

We want to test $H_{0} : θ = θ_{0}, H_{1} : θ > θ_{0}$ , which is called sign test Consider a statistic $S (θ_{0}) = i = 1 \sum n I (X_{i} > θ_{0})$ , which is called sign statistic We reject $H_{0}$ if $S (θ_{0}) \geq c$ .

Distribution of Sign Statistic

$I (X_{i} > θ_{0}) \sim B (1, \frac{1}{2})$ under $H_{0}$ . Therefore, $S (θ_{0}) \sim B (n, \frac{1}{2})$ . Also, by CLT, $\frac{S ( θ _{0} ) - \frac{n}{2}}{n /2} \sim N (0, 1)$

Consider a sequence of local alternatives $H_{0} : θ = 0, H_{1} : θ_{n} = δ / n, δ > 0$ The efficacy of the sign statistic $S (θ_{0})$ is $c_{S} = τ_{S}^{- 1} = 2 f (0)$

Composite Sign Test

Consider a composite hypothesis $H_{0} : θ \leq θ_{0}, H_{1} : θ > θ_{0}$ Since the power function $γ (θ)$ is non-decreasing, the sign test of size $α$ is $θ \leq θ_{0} max γ (θ)$

Facts

Consider a test $H_{0} : θ = θ_{0}, H_{1} : θ > θ_{0}$ and the test statistic $S (θ_{0}) = i = 1 \sum n I (X_{i} > θ_{0})$ , then $\forall k, P_{θ} (S (0) \geq k) = P_{0} (S (- θ) > k)$ Also, the power function $γ (θ)$ of the test is a non-decreasing function of $θ$

Consider a sequence of hypothesis $H_{0} : θ = 0, H_{1} : θ_{n} = δ / n, δ > 0$ , where $H_{1}$ is called a local alternative, then $lim_{n \to \infty} γ (θ_{n}) = 1 - Φ (z_{a} - δ τ_{S}^{- 1})$ where $Φ$ is a CDF of standard normal distribution, and $τ_{S} = c_{S}^{- 1} = \frac{1}{2 f ( 0 )}$

Let $X_{i} = θ + ϵ_{i}$ be random variables from location model, where $ϵ_{i}$ ‘s are i.i.d. with CDF $F (x)$ , PDF $f (x)$ , and median 0, and $Q_{2}$ be the sample median of $X_{i}$ , then $n (Q_{2} - θ) \to D N (0, τ_{s}^{2})$ where $τ_{s} = \frac{1}{2 f ( 0 )}$

Link to original

Efficacy of Sign Test
Definition

Let $T$ be a test statistic and $μ (θ_{n}) := E_{θ_{n}} (T)$ , then the efficacy of $T$ is defined as $c_{T} := lim_{n \to \infty} \frac{μ ^{'} ( 0 )}{[ n Var ( T ) ] ^{1/2}}$ where $μ^{'} (0)$ can be interpreted as the rate of change of the mean of $T$
Link to original

Asymptotic Relative Efficiency
Definition

Let $T_{1}, T_{2}$ be two test statistics, then the asymptotic relative efficiency of $T_{1}$ to $T_{2}$ is the ratio between those efficacies $ARE (T_{1}, T_{2}) := \frac{c _{T_{1}}^{2}}{c _{T_{2}}^{2}}$
Link to original

Signed-Rank Wilcoxon

Signed-Rank Wilcoxon Test
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample from cdf $F (x)$ , symmetric pdf $f (x)$ , and median $θ$ . Consider a Sign Test $H_{0} : θ = 0, H_{1} : θ > 0$

The signed-rank Wilcoxon test statistic for the test is defined as $T = i = 1 \sum n sgn (X_{i}) R ∣ X_{i} ∣$ where $R ∣ X_{i} ∣$ is the rank of $X_{i}$ , among $∣ X_{i} ∣, \dots, ∣ X_{n} ∣$

We reject $H_{0} : θ = 0$ if $T \geq c$ , where $c$ is determined by the level $α$

$R ∣ X_{i} ∣$ takes one of $1, 2, \dots, n$ . So, we can rewrite the statistic $T$ as $T = i = 1 \sum n j \cdot sgn (X_{ij})$ where $X_{ij}$ is an observation such that $R ∣ X_{i} ∣ = j$

Let $T^{+}$ be the sum of the ranks of positive $X_{i}$ ‘s, then $T = 2 T^{+} - \frac{n ( n + 1 )}{2}$ So. $T^{+} = \frac{T}{2} + \frac{n ( n + 1 )}{4}$ $E (T^{+}) = \frac{n ( n + 1 )}{4}, Va r_{H_{0}} (T^{+}) = \frac{n ( n + 1 ) ( 2 n + 1 )}{24}$

$T^{+}$ can be written as $T^{+} = i \leq j \sum\sum I (\frac{X _{i} + X _{j}}{2} > 0)$ where $\frac{X _{i} + X _{j}}{2}$ are called the Walsh averages

In general, if the median of $X_{i}$ is $θ$ , then $T^{+} = i \leq j \sum\sum I (\frac{X _{i} + X _{j}}{2} > θ)$

Consider a sequence of local alternatives $H_{0} : θ = 0, H_{1} : θ_{n} = δ / n, δ > 0$ The efficacy of the modified signed-rank Wilcoxon test statistic $T^{+}$ is $c_{T^{+}} = τ_{ω}^{- 1} = 12 \int_{- \infty}^{\infty} f^{2} (x) d x$

Facts

Assume that $f (x)$ is symmetric, then under $H_{0} : θ = 0$ , $∣ X_{1} ∣, ∣ X_{2} ∣, \dots, ∣ X_{n} ∣$ are independent of $sgn (X_{1}), sgn (X_{2}), \dots, sgn (X_{n})$

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample from cdf $F (x)$ , symmetric pdf $f (x)$ , and median $θ$ , and $T = i = 1 \sum n sgn (X_{i}) R ∣ X_{i} ∣$ , then under $H_{0} : θ = 0$

$T$ is distribution free

$E_{H_{0}} (T) = 0$

$Var_{H_{0}} (T) = \frac{n ( n + 1 ) ( 2 n + 1 )}{6}$

$\frac{T}{Var _{H_{0}} ( T )} \to D N (0, 1)$

Consider a sequence of hypothesis $H_{0} : θ = 0, H_{1} : θ_{n} = δ / n, δ > 0$ , then $lim_{n \to \infty} γ_{SR} (θ_{n}) = 1 - Φ (z_{a} - δ τ_{ω}^{- 1})$ where $Φ$ is a CDF of standard normal distribution, and $τ_{ω} = c_{T^{+}}^{- 1} = \frac{1}{12 \int _{- \infty}^{\infty} f ^{2} ( x ) d x}$

The estimator $\hat{θ}_{W}$ , which can be a solution of the equation $T^{+} (\hat{θ}_{W}) = \frac{n ( n + 1 )}{4}$ , is called Hodges-Lehmann estimator. Hodges-Lehmann estimator is an estimator of median of the Warsh average $\frac{X _{i} + X _{j}}{2}$ .

Let $X_{i} = θ + ϵ_{i}$ be random variables from location model, where $ϵ_{i}$ ‘s are i.i.d. with CDF $F (x)$ , symmetric PDF $f (x)$ , and median 0, then $n (\hat{θ}_{W} - θ) \to D N (0, τ_{W}^{2})$ where $τ_{W} = \frac{1}{12 \int _{- \infty}^{\infty} f ^{2} ( x ) d x}$

Link to original

Mann-Whitney-Wilcoxon Procedure

Mann-Whitney-Wilcoxon Procedure
Definition

Let $X_{1}, X_{2}, \dots, X_{n_{1}}$ be Random Sample from a distribution with cdf $F (x)$ , and $Y_{1}, Y_{2}, \dots, Y_{n_{2}}$ be Random Sample from a distribution with cdf $F (x - Δ)$ Consider a test $H_{0} : Δ = 0, H_{1} : Δ > 0$

The Mann-Whitney-Wilcoxon test statistic for the test is defined as $W = j = 1 \sum n_{2} R (Y_{j})$ where $R (Y_{j})$ is the rank of $Y_{j}$ in the combined (pooled) sample of size $n = n_{1} + n_{2}$

We reject $H_{0} : Δ = 0$ when $W \geq c$ , where $c$ is defined by the level $α$

$R (Y_{j}) = i = 1 \sum n_{1} I (X_{i} < Y_{i}) + k = 1 \sum n_{2} I (Y_{k} \leq Y_{j})$ Therefore, the Mann-Whitney-Wilcoxon test statistic is decomposed as $W = j = 1 \sum n_{2} R (Y_{j}) = i = 1 \sum n_{1} j = 1 \sum n_{2} I (X_{i} < Y_{j}) + k = 1 \sum n_{2} j = 1 \sum n_{2} I (Y_{k} \leq Y_{j}) = U + \frac{n _{2} ( n _{2} + 1 )}{2}$ where $U := i = 1 \sum n_{1} j = 1 \sum n_{2} I (X_{i} < Y_{j})$

So, We reject $H_{0} : Δ = 0$ when $U \geq c$ $E (U) = \frac{n _{1} n _{2}}{2}$

Also, the power function of the test is defined as $U (Δ) = i = 1 \sum n_{1} j = 1 \sum n_{2} I (X_{i} < Y_{j} - Δ) = i = 1 \sum n_{1} j = 1 \sum n_{2} I (Y_{j} - X_{i} > Δ)$

Consider a sequence of local alternatives $H_{0} : Δ = 0, H_{1} : Δ_{n} = δ / n, δ > 0$ , and assume that $\frac{n _{1}}{n} \to λ_{1}, \frac{n _{2}}{n} \to λ_{2}$ , where $λ_{1} + λ_{2} = 1$ The efficacy of the modified Mann-Whitney-Wilcoxon test statistic $U$ is $c_{U} = τ_{W}^{- 1} = 12 λ_{1} λ_{2} \int_{- \infty}^{\infty} f^{2} (x) d x$

Facts

Let $X_{1}, X_{2}, \dots, X_{n_{1}}$ be Random Sample from a distribution with cdf $F (x)$ , $Y_{1}, Y_{2}, \dots, Y_{n_{2}}$ be Random Sample from a distribution with cdf $F (x - Δ)$ , and $W = j = 1 \sum n_{2} R (Y_{j})$ then under $H_{0} : Δ = 0$

$W$ is distribution free

$E_{H_{0}} (W) = \frac{n _{2} ( n + 1 )}{2}$

$Var_{H_{0}} (W) = \frac{n _{1} n _{2} ( n + 1 )}{12}$

$\frac{W - n _{2} ( n + 1 ) /2}{Var _{H_{0}} ( W )} \to D N (0, 1)$

Consider a sequence of hypothesis $H_{0} : Δ = 0, H_{1} : Δ_{n} = δ / n, δ > 0$ , then $lim_{n \to \infty} γ_{U} (Δ_{n}) = 1 - Φ (z_{a} - λ_{1} λ_{2} δ τ_{W}^{- 1})$ where $Φ$ is a CDF of standard normal distribution, and $τ_{W} = c_{U}^{- 1} = \frac{1}{12 λ _{1} λ _{2} \int _{- \infty}^{\infty} f ^{2} ( x ) d x}$

Link to original

Measures of Association

Kendall's Tau
Definition

Let $(X_{1}, X_{1}), (X_{2}, X_{2}), \dots, (X_{n}, X_{n})$ be a Random Sample from a bivariate distribution with cdf $F (x, y)$ We want to test $H_{0} : X, Y$ are independent

Let $(X_{1}, X_{1}), (X_{2}, X_{2})$ be independent pairs with the same bivariate distribution. If $sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})} = 1$ , then we say these pairs are concordant If $sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})} = - 1$ , then pairs are discordant

Kendall’s $τ$ is defined as $τ = P [sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})} = 1] - P [sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})} = - 1]$

$τ$ is bounded by $\pm 1$ . If $X, Y$ are independent, then $τ = 0$

Consider another statistic $K = (2 n)^{- 1} i < j \sum\sum sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})}$

$E [K] = τ$ , so it is an unbiased estimator of $τ$

Facts

Let $(X_{1}, X_{1}), (X_{2}, X_{2}), \dots, (X_{n}, X_{n})$ be a Random Sample from a bivariate distribution with cdf $F (x, y)$ , and $K = (2 n)^{- 1} i < j \sum\sum sgn {(X_{1} - X_{2}) (Y_{1} - Y_{2})}$ , then under $H_{0} : X, Y$ are independent

$K$ is distribution free with a symmetric pdf

$E_{H_{0}} (K) = 0$

$Var_{H_{0}} (K) = \frac{2 ( 2 n + 1 )}{9 n ( n - 1 )}$

$\frac{K}{Var _{H_{0}} ( K )} \to D N (0, 1)$

Link to original

Spearman's Rho
Definition

Let $(X_{1}, X_{1}), (X_{2}, X_{2}), \dots, (X_{n}, X_{n})$ be a Random Sample from a bivariate distribution with cdf $F (x, y)$

Instead of a sample Pearson Correlation Coefficient, we define another statistic using the ranks of samples $r_{S} = \frac{\sum _{i = 1}^{n} ( R ( X _{i} ) - \frac{n + 1}{2} ) ( R ( Y _{i} ) - \frac{n + 1}{2} )}{n ( n ^{2} - 1 ) /12}$ which is called Spearman’s $ρ$

$r_{s}$ is bounded by $\pm 1$ . If $X, Y$ are independent, then $r_{S} = 0$

Facts

Let $(X_{1}, X_{1}), (X_{2}, X_{2}), \dots, (X_{n}, X_{n})$ be a Random Sample from a bivariate distribution with cdf $F (x, y)$ , and $r_{S} = \frac{\sum _{i = 1}^{n} ( R ( X _{i} ) - \frac{n + 1}{2} ) ( R ( Y _{i} ) - \frac{n + 1}{2} )}{n ( n ^{2} - 1 ) /12}$ , then under $H_{0} : X, Y$ are independent

$r_{S}$ is distribution free with a symmetric pdf

$E_{H_{0}} (r_{S}) = 0$

$Var_{H_{0}} (r_{S}) = \frac{1}{n - 1}$

$\frac{r _{S}}{Var _{H_{0}} ( r _{S} )} \to D N (0, 1)$

Link to original

My Knowledge Base

Explorer

Mathematical Statistics Note

Probability and Distributions

The Probability Set Function

Sigma-Algebra

Definition

Intersection of Sigma-Fields

Facts

Probability Measure

Definition

Inclusion-Exclusion Formula

Definition

Increasing and Decreasing Sequence of Events

Definition

Facts

Boole's Inequality

Definition

Conditional Probability

Definition

Facts

Bayes Theorem

Definition

Discrete Case

Continuous Case

Parameter-Centric Notation

Examples

Statistical Independence

Definition

Statistical Independence of Events

Statistical Independence of Two Random Variables

Rigorous

Casual

Statistical Independence of Random Variables

Mutually Independent

Pointwise Independent

Statistical Independence of Stochastic Processes

Facts

Random Variables

Random Variable

Definition

Notations

Distribution Function

Definition

Facts

Transformation of Random Variable

Definition

Discrete

Transformation Technique

Continuous

Transformation Technique

CDF Technique

Expectation of a Random Variable

Expected Value

Definition

Continuous

Discrete

Expected Value of a Function

Continuous

Discrete

Properties

Linearity

Random Variables

Matrix of Random Variables

Notations

Some Special Expectations

Moment

Definition

Facts

Central Moment

Definition

Moment Generating Function

Definition

Univariate

Calculations of Moments

Uniqueness of MGF

Multivariate

Calculations of Multivariate Moments

Marginal MGF

Facts