Matrix Theory

Matrix Theory Note
Matrix Theory

Basic Theories

Matrix
Definition

$A_{m \times n} = a_{11} ⋮ a_{m 1} \dots ⋱ \dots a_{1 n} ⋮ a_{mn}$

Operations

Addition, Subtraction

$A_{m \times n} \pm B_{m \times n} = a_{11} \pm b_{11} ⋮ a_{m 1} \pm b_{m 1} \dots ⋱ \dots a_{1 n} \pm b_{1 n} ⋮ a_{mn} \pm b_{mn}$

Scalar Multiplication

$k A_{m \times n} = k a_{11} ⋮ k a_{m 1} \dots ⋱ \dots k a_{1 n} ⋮ k a_{mn}$

Transpose

$A^{⊺}$ $a_{ij} \to ⊺ a_{ji}$ , $A_{m \times n} \to ⊤ A_{n \times m}$

Trace

Determinant

Matrix Multiplication

$A_{m \times p} \cdot B_{p \times n} = a_{11} ⋮ a_{m 1} \dots ⋱ \dots a_{1 p} ⋮ a_{m p} b_{11} ⋮ b_{p 1} \dots ⋱ \dots b_{1 n} ⋮ b_{p n} = a_{11} + b_{11} + \dots + a_{1 p} + b_{p 1} ⋮ a_{m 1} + b_{11} + \dots + a_{m p} + b_{p 1} \dots ⋱ \dots a_{11} + b_{1 n} + \dots + a_{1 p} + b_{p n} ⋮ a_{m 1} + b_{1 n} + \dots + a_{m p} + b_{p n}_{m \times n}$ $A_{i}^{⊺} \cdot B_{j} = \sum_{k = 1}^{p} a_{ik} \cdot b_{kj} = c_{ij}$ where $1 \leq i \leq m$ , $1 \leq j \leq n$

Vector-Matrix Multiplication

$A x = [A_{1}, A_{2}, \dots, A_{n}] [x_{1}, x_{2}, \dots, x_{n}]^{⊺} = x_{1} A_{1} + x_{2} A_{2} + \dots + x_{n} A_{n} = \sum_{k = 1}^{n} x_{k} A_{k}$
Link to original

Trace
Definition

$tr (A) = \sum_{i = 1}^{n} a_{ii} = a_{11} + a_{22} + \dots + a_{nn}$

The sum of elements on the main diagonal

Facts

$tr (A + B) = tr (A) + tr (B)$

$tr (c A) = c tr (A)$

$tr (A^{⊺}) = tr (A)$

$tr (A B) = tr (B A)$

$tr (A B^{⊺}) = tr (B A^{⊺})$

$tr (A BC D) = tr (BC D A) = tr (C D A B) = tr (D A BC)$ ^[Cyclic property]

$tr (A) = \sum_{i = 1}^{n} λ_{i}$ The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix $A - λ I$ only has $λ$ term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of $λ^{n - 1}$ of $det (A - λ I)$ is always $- (a_{11} + a_{22} + \dots + a_{nn})$ . Also, since $λ_{i}$ are the solution of the Characteristic Polynomial, the expression is factorized as $det (A - λ I) = (λ - λ_{1}) (λ - λ_{2}) \dots (λ - λ_{n})$ and the coefficient of $λ^{n - 1}$ become a $- (λ_{1} + λ_{2} + \dots + λ_{n})$ Therefore, $- (a_{11} + a_{22} + \dots + a_{nn}) = - (λ_{1} + λ_{2} + \dots + λ_{n})$ and $Trace (A) = \sum_{i = 1}^{n} λ_{i}$ .

Link to original
Link to original

Determinant
Definition

$det (A) = ∣ A ∣$

Computation

$2 \times 2$ matrix

$a c b d = a d - b c$

$n \times n$ matrix

Laplace Expansion
Definition

Let $C$ be a cofactor matrix of $A$

Along the $j$ th column gives

$det (A) = \sum_{i = 1}^{n} a_{ij} C_{ij} = \sum_{i = 1}^{n} a_{ij} (- 1)^{i + j} M_{i, j}$

Along the $i$ th row gives

$det (A) = \sum_{j = 1}^{n} a_{ij} C_{ij} = \sum_{j = 1}^{n} a_{ij} (- 1)^{i + j} M_{i, j}$
Link to original

Facts

$det A \neq = 0 \Leftrightarrow \exists A^{- 1}$ $det A = 0 \Leftrightarrow ∄ A^{- 1}$

The volume of box^[parallelepiped] $\in R^{n}$ is expressed by a determinant

$det (I) = 1$

$det (A)$ changes sign, when two rows are exchanged (permutated)

$det (A)$ depends on linearly on the 1st row.

$det (t A) = t^{n} det (A)$

$det (AB) = det (A) det (B)$

If two row or column vectors in $A$ are equal, then $det (A) = 0$

Gauss Elimination does not change $det (A)$

If a matrix $A$ has zero row or column vectors, then $det (A) = 0$

If a matrix $A$ is diagonal or triangular, then the determinant of $A$ is the product of diagonal elements $det (A) = \prod_{i = 1}^{n} a_{ii}$ where $a_{ii}$ is the $i, i$ element of the matrix $A$ .

$det (A^{- 1}) = det (A)^{- 1}$

$det (A^{⊺}) = det (A)$

$∣ A ∣ = \prod_{i = 1}^{n} λ_{i}$ The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Link to original

$∣ I_{m} + A B ∣ = ∣ I_{n} + B A ∣$ where the dimensions of the matrices are $m \times n$ and $n \times m$ respectively.

$∣ A \pm u u^{⊺} ∣ = ∣ A ∣ (1 \pm u^{⊺} A^{- 1} u)$

Link to original

Idempotent Matrix
Definition

$A^{2} = A$ A matrix whose squared matrix is itself.

Facts

Eigenvalues of idempotent matrix is either $0$ or $1$ .

Let $A$ be a symmetric matrix. Then, $A$ has $r$ eigenvalues equal to $1$ and the rest zero $\Leftrightarrow A^{2} = A, rank (A) = r$ .

Let $A$ be an idempotent matrix, then $rank (A) = tr (A)$

If $A$ is idempotent, then $I - A$ also idempotent.

Link to original

Inverse Matrix

Rank of Matrix
Definition

$rank (A)$

The rank of matrix is the number of linearly independent column vectors, or the number of non-zero pivots in Gauss Elimination

Facts

$rank (A B) \leq min (rank (A), rank (B))$

For the non-singular matrices $P$ and $Q$ , $rank (P A Q) = rank (A)$

Rank–Nullity Theorem

$rank (A) = rank (A^{⊺}) = rank (A^{⊺} A) = rank (A A^{⊺})$

If a matrix $A$ is Symmetric Matrix, then $rank (A)$ is the number of non-zero eigenvalues.

Link to original

Inverse Matrix
Definition

$A^{- 1}$ satisfies $A A^{- 1} = A^{- 1} A = I$ for given matrix $A$

Computation

2 x 2 matrix

$A^{- 1} = \frac{1}{∣ A ∣} [d - c - b a]$ where the matrix $A = [a c b d]$ , and $∣ A ∣$ is a Determinant of the matrix

Using Cofactor

$A adj (A) = AC^{⊺} = det (A) I$ where the matrix $C = C_{11} C_{21} ⋮ C_{n 1} C_{12} C_{22} ⋮ C_{n 2} \dots \dots ⋱ \dots C_{1 n} C_{2 n} ⋮ C_{nn}$ is the cofactor matrix, which is formed by all the cofactors of a given matrix $A$ . Then, by the definition of inverse matrix, $A^{- 1} = \frac{1}{det ( A )} C^{⊺}$
Link to original

Sherman–Morrison Formula
Definition

Suppose $A \in R^{n \times n}$ is an invertible matrix an d $v, u \in R^{n}$ . Then, $A + u v^{⊺}$ is invertible $\Leftrightarrow 1 + v^{⊺} A^{- 1} u \neq = 0$ . In this case, $(A + u v^{⊺})^{- 1} = A^{- 1} - \frac{A ^{- 1} u v ^{⊺} A ^{- 1}}{1 + v ^{⊺} A ^{- 1} u}$

Proof

Multiplying $(A + u v^{⊺})$ to the RHS gives $(A^{- 1} - \frac{A ^{- 1} u v ^{⊺} A ^{- 1}}{1 + v ^{⊺} A ^{- 1} u}) (A + u v^{⊺}) = I + A^{- 1} u v^{⊺} - \frac{A ^{- 1} u v ^{⊺} A ^{- 1} ( A + u v ^{⊺} )}{1 + v ^{⊺} A ^{- 1} u} (1)$

Since $v^{⊺} A^{- 1} u$ is a scalar, The numerator of the last term can be expressed as $A^{- 1} u v^{⊺} A^{- 1} (A + u v^{⊺}) = A^{- 1} u v^{⊺} + (v^{⊺} A^{- 1} u) A^{- 1} u v^{⊺} = (1 + v^{⊺} A^{- 1} u) A^{- 1} u v^{⊺}$

So, $(1) = I + A^{- 1} u v^{⊺} - \frac{( 1 + v ^{⊺} A ^{- 1} u ) A ^{- 1} u v ^{⊺}}{1 + v ^{⊺} A ^{- 1} u} = (1) = I + A^{- 1} u v^{⊺} - A^{- 1} u v^{⊺} = I$ $∴ (A + u v^{⊺})^{- 1} = A^{- 1} - \frac{A ^{- 1} u v ^{⊺} A ^{- 1}}{1 + v ^{⊺} A ^{- 1} u}$

Examples

Updating Fitted Least Square Estimator

Sherman–Morrison formula can be used for updating fitted Least Square estimator

Let $\hat{x} = (A^{⊺} A)^{- 1} A^{⊺} b$ be the least square estimator of $Ax = b$ and the matrices with a new data be $A_{a} = [A a^{⊺}]$ and $b_{a} = [b b]$ Then, $\hat{x}_{new} = (A_{a}^{⊺} A_{a})^{- 1} A_{a}^{⊺} b_{a} = ([A^{⊺} a] [A a^{⊺}])^{- 1} [A^{⊺} a] [b b] = (A^{⊺} A + a a^{⊺})^{- 1} (A^{⊺} b + b a)$

Let $P := (A^{⊺} A)^{- 1}$ for convenience Then, $P_{a} := (A^{⊺} A + a a^{⊺})^{- 1} = P - \frac{Pa a ^{⊺} P}{1 + a ^{⊺} Pa} = (A^{⊺} A)^{- 1} - \frac{( A ^{⊺} A ) ^{- 1} a a ^{⊺} ( A ^{⊺} A ) ^{- 1}}{1 + a ^{⊺} ( A ^{⊺} A ) ^{- 1} a}$ by Sherman-Morrison formula

So, $\hat{x}_{new} = (A^{⊺} A + a a^{⊺})^{- 1} (A^{⊺} b + b a) = (A^{⊺} A)^{- 1} A^{⊺} b - \frac{Pa a ^{⊺} P}{1 + a ^{⊺} Pa} A^{⊺} b + b P_{a} a = \hat{x} - \frac{Pa}{1 + a ^{⊺} Pa} a^{⊺} \hat{x} + b P_{a} a$ where $P_{a} a = Pa - \frac{Pa a ^{⊺} Pa}{1 + a ^{⊺} Pa} = \frac{Pa + Pa a ^{⊺} Pa - Pa a ^{⊺} Pa}{1 + a ^{⊺} Pa} = \frac{Pa}{1 + a ^{⊺} Pa}$ by expansion $∴ \hat{x}_{new} = \hat{x} - P_{a} a a^{⊺} \hat{x} + b P_{a} a = \hat{x} + P_{a} a (b - a^{⊺} \hat{x})$

So, $\hat{x}_{new}$ can be obtained without additional inverse matrix calculation.

Facts

Sherman–Morrison formula is a special case of the Woodbury Formula
Link to original

Matrix Inversion Lemma
Definition

$(A + UCV)^{- 1} = A^{- 1} - A^{- 1} U (C^{- 1} + V A^{- 1} U)^{- 1} V A^{- 1}$ where the size of matrices are $A$ is $n \times n$ , $C$ is $k \times k$ , and $V$ is $k \times n$ and $A, C, (C^{- 1} + V A^{- 1} U)$ should be invertible.

The inverse of a rank-k correction of some matrix can be computed by doing a rank-k correction to the inverse of the original matrix

Proof

Start by the matrix $[A V U C]$

The $L D U$ decomposition of the original matrix become $[A V U C] = [I V A^{- 1} 0 I] [A 0 0 C - V A^{- 1} U] [I 0 A^{- 1} U I]$ Inverting both sides gives^[Diagonalizable Matrix] $[A V U C]^{- 1} = [I 0 A^{- 1} U I]^{- 1} [A 0 0 C - V A^{- 1} U]^{- 1} [I V A^{- 1} 0 I]^{- 1} = [I 0 - A^{- 1} U I] [A^{- 1} 0 0 (C - V A^{- 1} U)^{- 1}] [I - V A^{- 1} 0 I] = [A^{- 1} + A^{- 1} U (C - V A^{- 1} U)^{- 1} V A^{- 1} - (C - V A^{- 1} U)^{- 1} V A^{- 1} - A^{- 1} U (C - V A^{- 1} U)^{- 1} (C - V A^{- 1} U)^{- 1}] (1)$

The $U D L$ decomposition of the original matrix become $[A V U C] = [I 0 U C^{- 1} I] [A - U C^{- 1} V 0 0 C] [I C^{- 1} V 0 I]$ Again inverting both sides, $[A V U C]^{- 1} = [I C^{- 1} V 0 I]^{- 1} [A - U C^{- 1} V 0 0 C]^{- 1} [I 0 U C^{- 1} I]^{- 1} = [I - C^{- 1} V 0 I] [(A - U C^{- 1} V)^{- 1} 0 0 C^{- 1}] [I 0 - U C^{- 1} I] = [(A - U C^{- 1} V)^{- 1} - C^{- 1} V (A - U C^{- 1} V)^{- 1} - (A - U C^{- 1} V)^{- 1} U C^{- 1} C^{- 1} + C^{- 1} V (A - U C^{- 1} V)^{- 1} U C^{- 1}] (2)$

The first-row first-column entry of RHS of (1) and (2) above gives the Woodbury formula $(A - U C^{- 1} V)^{- 1} = A^{- 1} + A^{- 1} U (C - V A^{- 1} U)^{- 1} V A^{- 1}$
Link to original

Moore-Penrose Inverse
Definition

For a $p \times q$ matrix $A$ , the Moore-Penrose inverse of $A$ is a $q \times p$ matrix $A^{+}$ is satisfying the following conditions

$A A^{+} A = A$

$A^{+} A A^{+} = A^{+}$

$(A A^{+})^{†} = A A^{+}$ ¹

$(A^{+} A)^{†} = A^{+} A$ ¹

Facts

The pseudoinverse is defined and unique for all matrices whose entries are real and complex numbers.

If the matrix $A$ is a square matrix, then $A^{+}$ is also a square matrix and $A^{+} = A^{- 1}$

Footnotes

Hermitian Matrix ↩ ↩²

Link to original

Partitioned Matrix

Block Matrix
Definition

A matrix that is interpreted as having been broken into sections called blocks or sub-matrices

Operations

Addition, Subtraction

$A_{11} ⋮ A_{m 1} \dots \dots A_{1 n} ⋮ A_{mn} \pm B_{11} ⋮ B_{m 1} \dots \dots B_{1 n} ⋮ B_{mn} = A_{11} \pm B_{11} ⋮ A_{m 1} \pm B_{m 1} \dots \dots A_{1 n} \pm B_{1 n} ⋮ A_{mn} \pm B_{mn}$

Scalar Multiplication

$c A_{11} ⋮ A_{m 1} \dots \dots A_{1 n} ⋮ A_{mn} = c A_{11} ⋮ c A_{m 1} \dots \dots c A_{1 n} ⋮ c A_{mn}$

Matrix Multiplication

$B_{11} ⋮ B_{k 1} \dots \dots B_{1 n} ⋮ B_{kn} A_{11} ⋮ A_{m 1} \dots \dots A_{1 n} ⋮ A_{mn} = (B_{11} A_{11} + \dots + B_{1 m} A_{m 1}) ⋮ (B_{k 1} A_{11} + \dots + B_{km} A_{m 1}) \dots \dots (B_{11} A_{1 n} + \dots + B_{1 m} A_{mn}) ⋮ (B_{k 1} A_{1 n} + \dots + B_{km} A_{mn})$

Determinant

Let $D$ or $A$ be invertible matrices $A C B D = ∣ D ∣ \cdot ∣ A - B D^{- 1} C ∣ = ∣ A ∣ \cdot ∣ D - C A^{- 1} B ∣$
Link to original

Eigenvalues and Eigenvectors

Eigendecomposition
Definition

$Av = λ v$ where $A$ is a $n \times n$ matrix and $v \neq = 0$

If $Av$ is a scalar multiple of non-zero vector $v$ , then $λ$ is the eigenvalue and $v$ is the eigenvector.

Characteristic Polynomial

$∣ A - λ I ∣ = 0$ The values $λ_{i}$ satisfying the characteristic polynomial, are the eigenvalues of the matrix $A$

Eigenspace

$E = {v : (A - λ I) v = 0}$

The set of all eigenvectors of $A$ corresponding to the same eigenvalue, together with the zero vector.

The Kernel of the matrix $(A - λ I)$

Eigenvector: non-zero vector in the eigenspace

Algebraic Multiplicity

$μ_{A} (λ_{i})$

Let $λ_{i}$ be an eigenvalue of an $n \times n$ matrix $A$ . The algebraic multiplicity $μ_{A} (λ_{i})$ of the eigenvalue is its multiplicity as a root of a Characteristic Polynomial, that is, the largest k such that $(λ - λ_{i})^{k}$

Geometric Multiplicity

$γ_{A} (λ_{i})$ Let $λ_{i}$ be an eigenvalue of an $n \times n$ matrix $A$ . The geometric multiplicity $γ_{A} (λ_{i})$ of the eigenvalue is the dimension of the Eigenspace associated with the eigenvalue.

Computation

Find the solution^[eigenvalues] of the Characteristic Polynomial.

Find the solution^[eigenvectors] of the Under-Constrained System $(A - λ_{i} I) v_{i} = 0$ using the found eigenvalue.

Facts

There exists at least one eigenvector corresponding to the eigenvalue $λ$

Eigenvectors corresponding to different eigenvalues are always linearly independent.

When $A$ is a normal or real Symmetric Matrix, the eigendecomposition is called Spectral Decomposition

$tr (A) = \sum_{i = 1}^{n} λ_{i}$ The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix $A - λ I$ only has $λ$ term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of $λ^{n - 1}$ of $det (A - λ I)$ is always $- (a_{11} + a_{22} + \dots + a_{nn})$ . Also, since $λ_{i}$ are the solution of the Characteristic Polynomial, the expression is factorized as $det (A - λ I) = (λ - λ_{1}) (λ - λ_{2}) \dots (λ - λ_{n})$ and the coefficient of $λ^{n - 1}$ become a $- (λ_{1} + λ_{2} + \dots + λ_{n})$ Therefore, $- (a_{11} + a_{22} + \dots + a_{nn}) = - (λ_{1} + λ_{2} + \dots + λ_{n})$ and $Trace (A) = \sum_{i = 1}^{n} λ_{i}$ .

$∣ A ∣ = \prod_{i = 1}^{n} λ_{i}$ The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Not all matrices have $n$ linearly independent eigenvectors.

When $Av = λ v$ holds,

the eigenvalues of $A^{- 1}$ are $\frac{1}{λ}$ and the eigenvectors of the matrix $A^{- 1}$ are the same as $A$ .

the eigenvalues of $A^{k}$ is $λ^{k}$

the eigenvalues of $c A$ is $c λ$

the eigenvalues of $A + c I$ is $λ + c$

the eigenvalues of $(A + c I)^{- 1}$ is $1/ (λ + c)$

$1 \leq γ_{A} (λ_{i}) \leq μ_{A} (λ_{i}) \leq n$

An eigenvalue’s Geometric Multiplicity cannot exceed its Algebraic Multiplicity

$\exists P s.t. A = P D P^{- 1} \Leftrightarrow \forall λ_{i} : γ_{A} (λ_{i}) = μ_{A} (λ_{i})$

the matrix is diagonalizable $\Leftrightarrow$ the Geometric Multiplicity is equal to the Algebraic Multiplicity for all eigenvalues

Let $A$ be a symmetric matrix. Then, $A$ has $r$ eigenvalues equal to $1$ and the rest zero $\Leftrightarrow A^{2} = A, rank (A) = r$ .

Link to original

The non-zero eigenvalues of $A B$ are the same as those of $B A$ .

Link to original

Quadratic Forms and Positive Definite Matrix

Quadratic Form
Definition

$q (x) = x^{⊺} Ax$ where $A$ is a Symmetric Matrix

A mapping $Q : V \to K$ where $V$ is a Module on Commutative Ring $K$ that has the following properties. $\forall k, l \in K, \forall u, v, w \in V$

$Q (k v) = k^{2} Q (v)$

$Q (u + v + w) = Q (u + v) + Q (v + w) + Q (u + w) - Q (u) - Q (v) - Q (w)$

$Q (k u + l v) = k^{2} Q (u) + l^{2} Q (v) + k lQ (u + v) - k lQ (u) - k lQ (v)$

Matrix Expressions

$a_{1} x_{1}^{2} + a_{2} x_{2}^{2} + 2 a_{3} x_{1} x_{2} \Leftrightarrow [x_{1} x_{2}] [a_{1} a_{3} a_{3} a_{2}] [x_{1} x_{2}]$

$a_{1} x_{1}^{2} + a_{2} x_{2}^{2} + a_{3} x_{3}^{2} + 2 a_{4} x_{1} x_{2} + 2 a_{5} x_{1} x_{3} + 2 a_{6} x_{2} x_{3} \Leftrightarrow [x_{1} x_{2} x_{3}] a_{1} a_{4} a_{5} a_{4} a_{2} a_{6} a_{5} a_{6} a_{3} x_{1} x_{2} x_{3}$

Facts

Let $x$ be a Random Vector and $A$ be a symmetric matrix of constants. If $E (x) = μ$ and $Var (x) = Σ$ , then the expectation of the quadratic form $Q := x^{⊺} A x$ is $E (Q) = μ^{⊺} A μ + tr (A Σ)$

Let $x = (X_{1}, \dots, X_{n})^{⊺}$ be a Random Vector and $A$ be a symmetric matrix of constants. If $E (x) = θ = (θ_{1}, \dots, θ_{n})^{⊺}$ and $Var (X_{i}) = μ_{2}$ , $E [(X_{i} - θ_{i})^{3}] = μ_{3}$ , and $E [(X_{i} - θ_{i})^{4}] = μ_{4}$ , where $i = 1, 2, \dots, n$ , then the variance of the quadratic form $x^{⊺} A x$ is $Var (x^{⊺} A x) = (μ_{4} - 3 μ_{2}^{2}) a^{⊺} a + 2 μ_{2}^{2} tr (A^{2}) + 4 μ_{2} θ^{⊺} A^{2} θ + 4 μ_{3} θ^{⊺} A a$ where $a$ is the column vector of diagonal elements of $A$ .

If $X_{i} \sim N (θ_{i}, μ_{2}), i = 1, \dots, n$ and $X_{i}$ ‘s are independent, then $Var (x^{⊺} A x) = 2 μ_{2}^{2} tr (A^{2}) + 4 μ_{2} θ^{⊺} A^{2} θ$ If $X_{i} \sim N (0, 1), i = 1, \dots, n$ and $X_{i}$ ‘s are independent, then $Var (x^{⊺} A x) = 2 tr (A^{2})$

Let $X^{⊺} \sim N (0, σ^{2} I_{n})$ , $Q := X^{⊺} A X / σ^{2}$ , where $A$ is a Symmetric Matrix and $rank (A) = r \leq n$ , then the MGF of $Q$ is $M (t) = \prod_{i = 1}^{r} (1 - 2 t λ_{i})^{- 1/2} = ∣ I - 2 t A ∣^{- 1/2}$ where $λ_{i}$ ‘s are non-zero eigenvalue of $A$

Let $X \sim N_{n} (μ, Σ)$ , where $Σ$ is Positive-Definite Matrix, then $Q = (X - μ)^{⊺} Σ^{- 1} (X - μ) \sim χ^{2} (n)$

Let $X \sim N (0, σ^{2} I_{n})$ , $Q = X^{⊺} A X / σ^{2}$ , where $A$ is a Symmetric Matrix and $rank (A) = r \leq n$ , then $Q \sim χ^{2} (r) \Leftrightarrow A = A^{k}$ where $k \in N$

Let $X \sim N (0, σ^{2} I_{n})$ , $Q_{1} = X^{⊺} A X / σ^{2}, Q_{2} = X^{⊺} B X / σ^{2}$ , where $A, B$ are symmetric matrices, then $Q_{1}, Q_{2}$ are independent if and only if $A B = O$

Let $Q = Q_{1} + Q_{2} + \dots + Q_{k - 1} + Q_{k}$ , where $Q, Q_{1}, Q_{2}, \dots, Q_{k}$ are quadratic forms in Random Sample from $N (0, σ^{2})$ If $Q / σ^{2} \sim χ^{2} (r), Q_{1} / σ^{2} \sim χ^{2} (r_{1}), Q_{2} / σ^{2} \sim χ^{2} (r_{2}), \dots, Q_{k - 1} / σ^{2} \sim χ^{2} (r_{k - 1})$ and $Q_{k}$ is non-negative, then

$Q_{1}, Q_{2}, \dots, Q_{k}$ are independent

$Q_{k} / σ^{2} \sim χ^{2} (r - r_{1} - r_{2} - \dots - r_{k - 1})$

Let $X = (X_{1}, X_{2}, \dots, X_{n}) \sim N (0, σ^{2} I_{n})$ , $i = 1 \sum n X_{i}^{2} = Q_{1} + Q_{2} + \dots + Q_{k}$ , where $Q_{j} = X^{⊺} A_{j} X$ , where $rank (A_{j}) = r_{j} \leq n$ , then $\forall j = {1, 2, \dots, k}, \frac{Q _{j}}{σ ^{2}} \sim χ^{2} (r_{j}) \Leftrightarrow \sum_{j = 1}^{k} r_{j} = n$

$x^{⊺} A x - 2 x^{⊺} b = (x - A^{- 1} b)^{⊺} A (x - A^{- 1} b) - b^{⊺} A^{- 1} b$

Link to original

Positive-Definite Matrix
Definition

$M ≻ 0 \Leftrightarrow \forall z \neq = 0, z^{⊺} M z > 0$

Matrix $M$ , in which $z^{⊺} M z$ is positive for every non-zero column vector $z$ is a positive-definite matrix

Facts

Let $M ≻ 0$ , then

$A^{- 1} ≻ 0$

$rank (C A C^{⊺}) = rank (C)$ where $C$ is a $p \times n$ matrix.

$rank (C) = p \Rightarrow CM C^{⊺} ≻ 0$

The diagonal elements of $M$ are positive.

For a Symmetric Matrix $B$ , $\exists t, M - tB ≻ 0$ for sufficiently small $∣ t ∣$

$\forall b, h : h \neq = 0 sup \frac{( h ^{⊺} b ) ^{2}}{h ^{⊺} M h} = b^{⊺} A^{- 1} b$

$\exists M^{1/2} ≻ 0 s.t. (M^{1/2})^{2} = M$

$M ≻ 0 \Leftrightarrow \exists R s.t. A = R R^{⊺}$ where $R$ is a non-singular matrix.

$M ≻ 0 \Leftrightarrow$ all the leading minor determinants of $M$ are positive.

$rank (X) = p \Rightarrow X^{⊺} X ≻ 0$

Link to original

Projection and Decomposition of Matrix

Projection Matrix
Definition

$P = A (A^{⊺} A)^{- 1} A^{⊺}$

For some vector $b$ , $P b$ is the projection of $b$ onto $A$

Facts

The projection matrix $P$ is symmetric Idempotent Matrix

Consider a Symmetric Matrix $P$ , then $P$ is idempotent with rank $r$ if and only if $r$ eigenvalues are $1$ and $n - r$ eigenvalues are $0$ .

$tr (P) = rank (P)$

The projection matrix is Positive Semi-Definite Matrix

If $P_{1}$ and $P_{2}$ are projection matrices, and $P_{1} - P_{2}$ is Positive Semi-Definite Matrix, then

$P_{1} P_{2} = P_{2} P_{1} = P_{2}$

$P_{1} - P_{2}$ is a projection matrix.

Link to original

QR Decomposition
Definition

$A = QR$ Decomposition of matrix $A$ into a product $A = QR$ of an orthonormal matrix $Q$ and an upper triangular matrix $R$ .

Computation

Using the Gram Schmidt Orthonormalization

QR decomposition can be computed by Gram Schmidt Orthonormalization. Where $Q = [q_{1}, \dots, q_{n}]$ is the matrix of orthonormal column vectors obtained by the orthonormalization and $R = q_{1}^{⊺} a_{1} 00 ⋮ 0 q_{1}^{⊺} a_{2} q_{2}^{⊺} a_{2} 0 ⋮ 0 q_{1}^{⊺} a_{3} q_{2}^{⊺} a_{3} q_{3}^{⊺} a_{3} ⋮ 0 \dots \dots \dots ⋱ \dots q_{1}^{⊺} a_{n} q_{2}^{⊺} a_{n} q_{3}^{⊺} a_{n} ⋮ q_{n}^{⊺} a_{n}$
Link to original

Cholesky Decomposition
Definition

$A = L L^{†}$

Decomposition of a Positive-Definite Matrix into the product of lower triangular matrix and its Conjugate Transpose.

Computation

Let $A = a_{11} a_{21} a_{31} a_{12} a_{22} a_{22} a_{13} a_{23} a_{33}$ be Positive-Definite Matrix. Then, $A = L L^{⊺} = L_{11} L_{21} L_{31} 0 L_{22} L_{32} 00 L_{33} L_{11} 00 L_{21} L_{22} 0 L_{31} L_{32} L_{33} = L_{11}^{2} L_{21} L_{11} L_{31} L_{11} L_{21}^{2} + L_{22}^{2} L_{31} L_{21} + L_{32} L_{22} (symmetric) L_{31}^{2} + L_{32}^{2} + L_{33}^{2}$ By setting the first-row first-column entry $L_{11} = a_{11}$ , can find other entries using substitution. $L_{21} = a_{21} / L_{11}, L_{31} = a_{31} / L_{11}, L_{22} = a_{22} - L_{21}^{2}, L_{32} = (a_{32} = L_{31} L_{21}) / L_{22}, L_{33} = a_{33} - L_{31}^{2} - L_{32}^{2}$ By summarizing, $L_{ii} = a_{ii} - \sum_{k = 1}^{i - 1} L_{ik}^{2}$ , $L_{ij} = \frac{1}{L _{jj}} (a_{ij} - \sum_{k = 1}^{j - 1} L_{ik} L_{jk})$
Link to original

Spectral Theorem
Definition

$A = A^{†} \Leftrightarrow A = Q Λ Q^{†}$ where $Q$ is a Unitary Matrix, and $Λ = diag (λ_{1}, λ_{2}, \dots, λ_{n})$

A matrix $A$ is a Hermitian Matrix if and only if $A$ is Unitary Diagonalizable Matrix.

Facts

Every Hermitian Matrix is diagonalizable, and has real-valued Eigenvalues and orthonormal eigenvector matrix.

For the Hermitian Matrix $A^{†} = A$ , the every eigenvalue is real.

Proof For the eigendecomposition $Ax = λ x$ , $x^{†} Ax = x^{†} λ x = λ x^{†} x = λ ∣∣ x ∣∣ \Rightarrow λ = \frac{x ^{†} Ax}{∣∣ x ∣∣}$ . Since $A$ is hermitian, $x^{†} Ax \in R$ and norm is always real. Therefore, every eigenvalue is real.

For the Hermitian Matrix $A^{†} = A$ , the eigenvectors from different eigenvalues are orthogonal.

Proof Let the eigendecomposition $A x_{1} = λ_{1} x_{1}, A x_{2} = λ_{2} x_{2}$ where $λ_{1} \neq = λ_{2}, x_{1} \neq = x_{2}$ . By the property of hermitian matrix, $(A x_{1})^{†} x_{2} = x_{1}^{†} A^{†} x_{2} = x_{1}^{†} λ_{2} x_{2} = λ_{2} x_{1}^{†} x_{2}$ and $(A x_{1})^{†} x_{2} = (λ_{1} x_{1})^{†} x_{2} = \overset{ˉ}{λ}_{1} x_{1}^{†} x_{2} = λ_{1} x_{1}^{†} x_{2}$ So, $λ_{2} x_{1}^{†} x_{2} = λ_{1} x_{1}^{†} x_{2} \Rightarrow (λ_{1} - λ_{2}) x_{1}^{†} x_{2} = 0$ . Since $λ_{1} \neq = λ_{2}$ , $x_{1}^{†} x_{2} = 0$ . Therefore, $x_{1}^{†}, x_{2}$ are orthogonal.

Link to original

Singular Value Decomposition
Definition

$A_{n \times m} = U_{n \times n} D_{n \times m} (V_{m \times m})^{⊺} = \sum_{i = 1}^{m i n (n, m)} d_{i} U_{i} V_{i}^{⊺}$ An arbitrary matrix $A_{n \times m}$ can be decomposed to $UD V^{⊺}$ .

For $n \geq m$ $A_{n \times m} = U_{n \times m} D_{m \times m} (V_{m \times m})^{⊺}$ where $U^{⊺} U = I_{m}$ and $V^{⊺} V = V V^{⊺} = I_{m}$ and $D$ is Diagonal Matrix

For $n \leq m$ $A_{n \times m} = U_{n \times n} D_{n \times n} (V_{m \times n})^{⊺}$ where $U^{⊺} U = U U^{⊺} = I_{n}$ and $V^{⊺} V = I_{n}$ and $D$ is Diagonal Matrix

Calculation

For matrix $A_{m \times n} = UD V^{⊺}$

$V_{n \times n}$ is the matrix of orthonormal eigenvectors of $A^{⊺} A$ $U_{m \times m}$ is the matrix of orthonormal eigenvectors of $A A^{⊺}$ $D_{m \times n}$ is the Diagonal Matrix made of the square roots of the non-zero eigenvalues of $A^{⊺} A$ and $A A^{⊺}$ sorted in descending order.

If the eigendecomposition is $A^{⊺} Ax = λ x$ , then the eigenvalues $λ \geq 0$ and the eigenvectors $x \in C (A^{⊺})$ are orthonormal by Spectral Theorem. If the $rank (A) = r \leq n$ , then $λ_{1} ⪈ λ_{2} ⪈ \dots \geq λ_{r}, λ_{r + 1} = \dots λ_{n} = 0$ . Where $σ_{i} = λ_{i}_{i = 1, 2, \dots, r}$ are called the singular values.

Now, the Orthonormal Matrix $V = [V_{1}, V_{2}]$ is calculated using the singular values. Where $V_{1} = [v_{1}, v_{2}, \dots, v_{r}]$ is the Orthonormal Matrix of eigenvectors corresponding to the non-zero eigenvalues and $V_{2} = [v_{r + 1}, v_{r + 2}, \dots, v_{n}]$ is the Orthonormal Matrix of eigenvectors corresponding to zero eigenvalues where each $v_{r} + i \in N (A^{⊺} A) = N (A)$ , and $D_{m \times n} = diag (σ_{1}, σ_{2}, \dots, σ_{r}, 0, \dots, 0)$ is a rectangular Diagonal Matrix.

Since $V$ is an orthonormal matrix and $D$ is a Diagonal Matrix, $AV = UD \Rightarrow A [v_{1}, v_{2}, \dots, v_{n}] = [u_{1}, u_{2}, \dots, u_{m}] D \Rightarrow A v_{i} = σ_{i} u_{i} \Rightarrow u_{i} = \frac{1}{σ _{i}} A v_{i}$ . Now, the Orthonormal Matrix $U = [U_{1}, U_{2}]$ is calculated using the linear system $\frac{1}{σ _{i}} A v_{i}$ and the null space of $A$ , $N (A)$ Where $U_{1} = [u_{1}, u_{2}, \dots, u_{r}]$ is the Orthonormal Matrix of the vectors $u_{{1, 2, \dots, r}} \in C (A)$ obtained from the system equation $\frac{1}{σ _{i}} A v_{i}$ And $U_{2} = [u_{r + 1}, u_{r + 2}, \dots, u_{m}]$ is the Orthonormal Matrix of the vectors $u_{{r + 1, r + 2, \dots, m}} \in N (A^{⊺})$ . which is corresponding to zero eigenvalues

The Orthonormal Matrix $U = [u_{1}, u_{2}, \dots, u_{m}]$ can also be formed by the eigenvectors of $A A^{⊺}$ similarly to calculating of $V$ .

Facts

$U, V$ are the Unitary Matrix

Let $A$ be a real symmetric Positive Semi-Definite Matrix Then, the Eigendecomposition(Spectral Decomposition) of $A$ and the singular value decomposition of $A$ are equal.

$A = PΛ P^{⊺} = UD V^{⊺}$ where $P = U = V$ and $Λ = D$ are non-negative and the shape of the every matrix is $n \times n$

Visualization

every matrix can be decomposed as a

$V^{⊺}$ : rotation and reflection

$D$ : scaling

$U$ : rotation and reflection

Link to original

Miscellanea in Matrix

Centering Matrix
Definition

$C = I - \frac{1}{n} J_{n}$ where $J_{n} = 1_{n} 1^{⊺}$ is $n \times n$ matrix where all elements are $1$

Summing Vector

$1_{n} = (1, 1, \dots, 1)^{⊺}$

Facts

$(n - 1) s^{2} = x^{⊺} C x$ where $s^{2}$ is a sample variance of $x$

Link to original

Derivatives of Matrix
Definition

Differentiation by vector

$\frac{\partial}{\partial x ^{⊺}} := [\frac{\partial}{\partial x _{1}}, \frac{\partial}{\partial x _{2}}, \dots, \frac{\partial}{\partial x _{n}}]$ where $x = [x_{1}, x_{2}, \dots, x_{n}]$

Differentiation by Matrix

$\frac{\partial}{\partial X} := \frac{\partial}{\partial x _{11}} ⋮ \frac{\partial}{\partial x _{n 1}} \dots ⋱ \dots \frac{\partial}{\partial x _{1 p}} ⋮ \frac{\partial}{\partial x _{n p}}$ where $X = x_{11} ⋮ x_{n 1} \dots ⋱ \dots x_{1 p} ⋮ x_{n p}$

Calculation

Let $f$ be a vector function, $x$ be an $n$ -dimensional vector, then $\frac{\partial f}{\partial x ^{⊺}} = [\frac{\partial f}{\partial x _{1}}, \frac{\partial f}{\partial x _{2}}, \dots, \frac{\partial f}{\partial x _{n}}]$ By the definition of derivative, we hold $df = f (x + d x) - f (x) = \frac{\partial f}{\partial x _{1}} d x_{1} + \frac{\partial f}{\partial x _{2}} d x_{2} + \dots + \frac{\partial f}{\partial x _{n}} d x_{n} = \frac{\partial f}{\partial x ^{⊺}} d x$ where $d x = [d x_{1}, d x_{2}]^{⊺}$

Examples

Let $x, y$ be $n$ -dimensional vectors, then $\frac{\partial}{\partial x} (x^{⊺} y) = \frac{\partial}{\partial x} (y^{⊺} x) = y$

Let $x$ be an $n$ -dimensional vector and $A$ be an $n \times n$ matrix, then $\frac{\partial}{\partial x} (x^{⊺} A) = A, \frac{\partial}{\partial x} (A x) = A^{⊺}$

Let $x$ be an $n$ -dimensional vector and $A$ be an $n \times n$ Symmetric Matrix, then the derivative of quadratic form is $\frac{\partial}{\partial x} (x^{⊺} A x) = A x + A^{⊺} x = 2 A x$
Link to original

Hessian Matrix
Definition

$H_{f} = {\frac{\partial ^{2} f}{\partial x _{i} \partial x _{j}}} = \frac{\partial ^{2} f}{\partial x _{1}^{2}} \frac{\partial ^{2} f}{\partial x _{2} x _{1}} ⋮ \frac{\partial ^{2} f}{\partial x _{n} \partial x _{1}} \frac{\partial ^{2} f}{\partial x _{1} \partial x _{2}} \frac{\partial ^{2} f}{\partial x _{2}^{2}} ⋮ \frac{\partial ^{2} f}{\partial x _{n} \partial x _{2}} \dots \dots ⋱ \dots \frac{\partial ^{2} f}{\partial x _{1} \partial x _{n}} \frac{\partial ^{2} f}{\partial x _{2} \partial x _{n}} ⋮ \frac{\partial ^{2} f}{\partial x _{n}^{2}}$

A square matrix of second-order partial derivative of a scalar-valued function
Link to original

Vectorization
Definition

Let $A = [a_{1}, a_{2}, \dots, a_{n}]$ be an $m \times n$ matrix, then $vec (A) = [a_{11}, a_{21}, \dots, a_{m 1}, a_{12}, a_{22}, \dots, a_{m 2}, \dots, a_{1 n}, a_{2 n}, \dots, a_{mn}]^{⊺}$

A linear transformation which converts the matrix into a vector.

Facts

Let $A, B, C, Z$ be matrices, then

$vec (A BC) = (C^{⊺} \otimes A) vec (B)$ $tr (A B) = vec (A^{⊺})^{⊺} vec (B)$ $tr (A Z^{⊺} BZC) = vec (Z^{⊺})^{⊺} (C A \otimes B^{⊺}) vec (Z)$

Link to original

Kronecker Product
Definition

Let $A$ be an $m \times n$ matrix and $B$ be a $p \times q$ matrix, then the Kronecker product $A \otimes B$ is the $p m \times q n$ block matrix $A \otimes B = a_{11} B ⋮ a_{m 1} B \dots ⋱ \dots a_{1 n} B ⋮ a_{mn} B$

Facts

Let $A, B, X, Y$ be matrices, then

$(A \otimes B)^{⊺} = A^{⊺} \otimes B^{⊺}$ $(A \otimes B) (X \otimes Y) = A X \otimes B Y$ $(A \otimes B)^{- 1} = A^{- 1} \otimes B^{- 1}$ , where $A, B$ are square matrices $rank (A \otimes B) = rank (A) rank (B)$ $tr (A \otimes B) = tr (A) tr (B)$ $∣ A_{p \times p} \otimes B_{m \times m} ∣ = ∣ A ∣^{m} ∣ B ∣^{p}$ Eigenvalues of $A \otimes B$ is the product of the eigenvalue of $A$ and the eigenvalue of $B$

Link to original

Norms

Norm
Definition

$∥ \cdot ∥ : V \to R_{0}^{+}$ A real-valued function with the following properties

Positive definiteness: $\forall v \in V : ∥ v ∥ = 0 \Leftrightarrow v = 0$

Absolute Homogeneity: $\forall v \in V, k \in K : ∥ k v ∥ = ∣ k ∣∥ v ∥$

Sub additivity or Triange inequality: $\forall v, w \in V : ∥ v + w ∥ \leq ∥ v ∥ + ∥ w ∥$ where $K \in {R, C}$ on a vector space $(V, +, \cdot, K)$

Vector

Real

Let $x \in R^{n}$ be an $n$ -dimensional vector, then the $L_{p}$ norm of the $x$ is defined as $∥ x ∥_{p} := (\sum_{i = 1}^{n} ∣ x_{i} ∣^{p})^{1/ p}$

Complex

Let $x \in C^{n}$ be an $n$ -dimensional complex vector, then the $L_{1}$ norm of the $x$ is defined as $∥ x ∥ = x^{†} x = \sum_{i = 1}^{n} \overset{x}{ˉ}_{i} x_{i}$

Function

$∥ f ∥ = ⟨ f, f ⟩ = \int_{a}^{b} f (x)^{2} d x$ : norm of $f$

Facts

$∥ \frac{f ( x )}{∥ f ∥} ∥ = 1$

a norm can be induced by a Inner product $d (v, w) = ∥ v - w ∥$

Link to original

Schatten Norm
Definition

Let $A$ be an $m \times n$ matrix, then the Schatten norm of $A$ is defined as $∣∣ A ∣ ∣_{p} = (\sum_{i = 1}^{m i n {m, n}} σ_{i}^{p})^{1/ p}$ where $σ_{i}$ is the $i$ -th singularvalue of $A$

Facts

$p = 1$ : Nuclear Norm

$p = 2$ : Frobenius Norm $p = \infty$ : Spectral Norm

Link to original

Frobenius Norm
Definition

Let $A$ be an $m \times n$ matrix, then the Frobenius norm of $A$ is defined as $∣∣ A ∣ ∣_{F} = (\sum_{i = 1}^{m} \sum_{h = 1}^{n} ∣ a_{ij} ∣^{2})^{1/2} = tr (A^{†} A)$
Link to original

Matrix p-Norm
Definition

Let $A$ be an $m \times n$ matrix, then the matrix p-norm of $A$ is defined as $∣∣ A ∣ ∣_{p} = x \neq = 0 sup \frac{∣∣ A x ∣ ∣ _{p}}{∣∣ x ∣ ∣ _{p}} = ∣∣ x ∣ ∣_{p} = 1 max ∣∣ A x ∣ ∣_{p}$

Facts

When $p = 2$ , the norm is a Spectral Norm.

Link to original

Spectral Norm
Definition

Let $A$ be an $m \times n$ matrix, then the matrix p-norm of $A$ is defined as $∣∣ A ∣ ∣_{2} = λ_{m a x} (A^{†} A) = σ_{m a x} (A)$
Link to original

Nuclear Norm
Definition

Let $A$ be an $m \times n$ matrix, then the nuclear norm of $A$ is defined as $∣∣ A ∣ ∣_{*} = tr (A^{†} A)) = \sum_{i = 1}^{m i n {m, n}} σ_{i} (A)$
Link to original

L-pq Norm
Definition

$L_{pq}$ norm is an entry-wise matrix norm. $∣∣ A ∣ ∣_{p, q} = i = 1 \sum n (j = 1 \sum n ∣ a_{ij} ∣^{p})^{q / p}^{1/ q}$ where $p, q \geq 1$

Facts

When $p = q = 1$ , the norm is the sum of the absolute values of every entry and is called a matrix $L_{1}$ norm.
Link to original

Rayleigh-Ritz Theorem
Definition

Let $A$ be an $n \times n$ Hermitian Matrix with eigenvalues sorted in descending order $λ_{1}, \dots, λ_{p}$ , where $p \leq n$ , then $x \neq = 0 max \frac{x ^{⊺} A x}{x ^{⊺} x} = ∣∣ x ∣ ∣_{2} = 1 max x^{⊺} A x = λ_{1}$ $x \neq = 0 min \frac{x ^{⊺} A x}{x ^{⊺} x} = ∣∣ x ∣ ∣_{2} = 1 min x^{⊺} A x = λ_{p}$
Link to original
Link to original

Elementary Statistical Theory

Random Variable and Random Sampling

Random Variable
Definition

$X : Ω \to R s.t. \forall B \in R, \exists ω \in Ω, s.t. X^{- 1} (B) = {ω : X (ω) \in B} \in F$

A random variable is a function $X : Ω \to R$ whose inverse function $X^{- 1} (B)$ is $F$ -measurable for the two measurable spaces $(Ω, F)$ and $(R, R)$ .

The inverse image of an arbitrary Borel set of Codomain $R$ is an element of sigma field $F$ .

Notations

Consider a probability space $(Ω, F, P)$

Outcomes: $H, T$

Set of outcomes (Sample space): $Ω = {H, T}$

Events: $\emptyset, {H}, {T}, Ω$

Set of events (Sigma-Field): $F = {\emptyset, {H}, {T}, {H, T}}$

Probabilities: $P : F \to [0, 1]$

Random variable: $X : Ω \to R$

For a random variable $X$ on a Probability Space $(Ω, F, P)$

$X \sim μ_{X}$ if and only if the Distribution of $X$ is $μ_{X}$

$X \sim F_{X}$ if and only if the Distribution Function of $X$ is $F_{X}$

For a random variable $X$ on Probability Space $(Ω_{X}, F_{X}, P_{X})$ and another random variable $Y$ on Probability Space $(Ω_{Y}, F_{Y}, P_{Y})$

$X = d Y ⟺ \forall B \in R : μ_{X} (B) = μ_{Y} (B)$

$X = d Y ⟺ \forall k \in R : F_{X} (k) = F_{Y} (k)$

$X = d Y ⟺ \forall k \in R : P_{X} (X \leq k) = P_{Y} (Y \leq k)$

Link to original

Expected Value
Definition

$E (X) = \int_{Ω} X d P = \int_{R} x d μ_{X} = \int_{R} x d F_{X}$ The expected value of the Random Variable $X$ on Probability Space $(Ω, F, P)$

Continuous

$E (X) = \int_{- \infty}^{\infty} x f_{X} (x) d x$ where $f_{X} (x)$ is a PDF of Random Variable $X$

The expected value of the Random Variable $X$ when $F_{X}$ satisfies absolute continuous over $λ$ , $μ_{X} << λ$

Discrete

$E (X) = \sum_{x} x p_{X} (x)$ where $p_{X} (x)$ is a PMF of Random Variable $X$

The expected value of the Random Variable $X$ when $F_{X}$ satisfies absolute continuous over $#$ , $μ_{X} << #_{X}$ In other words, $F_{X}$ has at most countable jumps.

Expected Value of a Function

$E (g (x)) = \int_{R} g (x) d μ_{X}$

Continuous

$E (g (x)) = \int_{R} g (x) f_{X} (x) d x$

Discrete

$E (g (x)) = x \in S_{X} \sum g (x) p_{X} (x)$

Properties

Linearity

Random Variables

If $\exists k_{i} E [X_{i}]$ , then $E [\sum_{i = 1}^{n} k_{i} X_{i}] = \sum_{i = 1}^{n} k_{i} E (X_{i})$

Matrix of Random Variables

Let $W_{1}, W_{2}$ be a $m \times n$ matrices of random variables, $A_{1}, A_{2}$ be $k \times m$ matrices of constants, and $B$ a $n \times l$ matrix of constant. Then, $E [A_{1} W_{1} + A_{2} W_{2}] = A_{1} E [W_{1}] + A_{2} E [W_{2}]$

$E [A_{1} W_{1} B] = A_{1} E [W_{1}] B$

Notations

Expression Discrete Continuous
Expression for the event $2^{Ω}$ and the probability $P$ $\sum_{ω \in Ω} X (ω) \cdot P (ω)$ $\int_{Ω} X (ω) \cdot d P (ω)$
Expression for the measurable space $(R, R)$ and the distribution $μ_{X}$ $\sum_{x \in R} x \cdot f (x)$ $\int_{R} f (x) λ (x) = \int_{R} \frac{d μ _{X}}{d λ} λ (x) = \int_{R} d μ_{X} = \int_{R} d F_{X}$
Link to original

Expression	Discrete	Continuous
Expression for the event $2^{Ω}$ and the probability $P$	$\sum_{ω \in Ω} X (ω) \cdot P (ω)$	$\int_{Ω} X (ω) \cdot d P (ω)$
Expression for the measurable space $(R, R)$ and the distribution $μ_{X}$	$\sum_{x \in R} x \cdot f (x)$	$\int_{R} f (x) λ (x) = \int_{R} \frac{d μ _{X}}{d λ} λ (x) = \int_{R} d μ_{X} = \int_{R} d F_{X}$

Multivariate Distribution
Definition

Joint CDF

The Distribution Function $F_{X} : R^{n} \to [0, 1]$ of a Random Vector $X = (X_{1}, \dots, X_{n})^{⊺}$ is defined as $F_{X} (x) = P (X \leq x) = P (X_{1} \leq x_{1}, \dots, X_{n} \leq x_{n})$ where $x = (x_{1}, \dots, x_{n})^{⊺}$

Joint PDF

$p_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = P (X_{1} = x_{1}, \dots, X_{n} = x_{n})$

The joint probability Distribution Function of $n$ discrete Random Vector $X_{1}, X_{2}, \dots, X_{n}$

$f_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = \frac{\partial ^{n} F _{X_{1}, \dots, X_{n}} ( x _{1} , \dots , x _{n} )}{\partial x _{1} \dots \partial x _{n}}$

The joint probability Distribution Function of $n$ continuous Random Vector $X_{1}, X_{2}, \dots, X_{n}$

Expected Value of a Multivariate Function

Continuous

$E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} g(\mathbf{x})f(\mathbf{x})dx_{1} \dots \dots dx_{n}$

Discrete

$E [g (X)] = x_{n} \sum \dots x_{1} \sum g (x) f (x)$

Marginal Distribution of a Multivariate Function

Marginal CDF

$F_{X_{1}} (x_{1}) = x_{2}, \dots, x_{n} \to \infty lim F_{X} (x)$

Marginal PDF

$f_{X_{1}}(x_{1}) = E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} f(x_{2}, \dots, x_n)dx_{2} \dots \dots dx_{n}$

Conditional Distribution of a Multivariate Function

$f_{2, \dots, n ∣1} (x_{2}, \dots, x_{n} ∣ x_{1}) = \frac{f _{X} ( x )}{f _{X_{1}} ( x _{1} )}$

$f_{1∣2, \dots, n} (x_{1} ∣ x_{2}, \dots, x_{n}) = \frac{f _{X} ( x )}{f _{X_{2}, \dots, X_{n}} ( x _{2} , \dots , x _{n} )}$

Properties

Linearity

If $\exists E [g_{1} (X_{1}, X_{2})], E [g_{2} (X_{1}, X_{2})]$ , then $E [k_{1} g_{1} (X_{1}, X_{2}) + k_{2} g_{2} (X_{1}, X_{2})] = k_{1} E [g_{1} (X_{1}, X_{2})] + k_{2} E [g_{2} (X_{1}, X_{2})]$
Link to original

Covariance
Definition

$Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})]$

Properties

Covariance with Itself

$Cov (X, X) = Var (X)$

Covariance of Linear Combinations

$Cov (i = 1 \sum n a_{i} X_{i}, i = 1 \sum n b_{i} Y_{i}) = i = 1 \sum n j = 1 \sum m a_{i} b_{j} Cov (X_{i}, Y_{j})$

$Var (i = 1 \sum n a_{i} X_{i}) = Cov (i = 1 \sum n a_{i} X_{i}, i = 1 \sum n a_{i} X_{i}) = i = 1 \sum n a_{i}^{2} Var (X_{i}) + 2 i, j : i < j \sum a_{i} a_{j} Cov (X_{i}, X_{j})$

Link to original

Transclude of Density-Function

Distribution Function
Definition

$F_{X} (x) : R \to [0, 1] := μ_{X} ((- \infty, x]) = P (X \leq x)$

A distribution function $F_{X}$ is a function for the Random Variable on Probability Space $(Ω, F, P)$

Facts

$(Ω, F, P) \sim (R, R, μ_{X})$

Proof By definition, $F_{X} (x) = μ_{X} ((- \infty, x])$ So, defining $\forall x \in R, F_{X} (x)$ is the Equivalence Relation to defining $A = {(- \infty, x] : x \in R}, μ_{X} : A \to [0, 1]$

Since $A$ is a Pi-System, $μ_{X} : σ (R) = R \to [0, 1]$ is uniquely determined by $μ_{X} : A \to [0, 1]$ by Extension from Pi-System

Now, $(R, R, μ_{X})$ is a Measurable Space induced by $X$ . Therefore, defining $μ_{X}$ on $(R, R)$ is equivalent to the defining $P$ on $(Ω, F)$

Distribution function(CDF) $F : R \to [0, 1]$ has the following properties

Monotonic increasing: $\forall a, b, \in R, a < b \Rightarrow F (a) \leq F (b)$

$x \to - \infty lim F (x) = 0$

$x \to \infty lim F (x) = 1$

Right-continuous: $x \to x_{0}^{+} lim F (x) = F (x_{0})$

If a function $F : R \to [0, 1]$ satisfies the following properties, then $F$ is a distribution function(CDF) of some Random Variable $X$

Monotonic increasing: $\forall a, b, \in R, a < b \Rightarrow F (a) \leq F (b)$

$x \to - \infty lim F (x) = 0$

$x \to \infty lim F (x) = 1$

Right-continuous: $x \to x_{0}^{+} lim F (x) = F (x_{0})$

$\forall a, b \in R, a < b \Rightarrow (P (a < X \leq b) = F_{X} (b) - F_{X} (a))$

Link to original

Bernoulli Distribution
Definition

$X \sim Ber (p) = B (1, p)$ where $p \in [0, 1]$ is a probability of success

number of success in a single trial with success probability $p$

Bernoulli Process

The i.i.d. Random Vector $X_{1}, \dots, X_{n}$ with Bernoulli distribution

Properties

PDF

$f (x) = p^{x} (1 - p)^{1 - x}, x \in {0, 1}$

Mean

$E (X) = p$

Variance

$Var (X) = p (1 - p)$
Link to original

Binomial Distribution
Definition

$X \sim B (n, p)$ where $n$ is the length of bernoulli process, and $p \in [0, 1]$ is a probability of success

The number of successes in length $n$ bernoulli process with success probability $p$

Properties

PDF

$f (x) = (x n) p^{x} (1 - p)^{n - x}, x \in {0, 1, \dots, n}$

Mean

$E (X) = n p$

Variance

$Var (X) = n p (1 - p)$

MGF

$M_{X} (t) = [(1 - p) + p e^{t}]^{n}$

Facts

Let $X_{i} \sim B (n_{i}, p)$ be independent random variables following binomial distribution, then $Y = i = 1 \sum n X_{i} \sim B (i = 1 \sum n n_{i}, p)$

Link to original

Multinomial Distribution
Definition

$X \sim Mult (n, p_{1}, p_{2}, \dots, p_{k - 1})$ where $n$ is the number of trials, $p_{i}$ is a probability of category $i$

Distribution that describes the probability of observing a specific combination of outcomes

Properties

PDF

$f (x_{1}, x_{2}, \dots, x_{k - 1}) = n! \prod_{i = 1}^{k} \frac{p _{i}^{x_{i}}}{x _{i} !}$ where $x_{k} = n - i = 1 \sum k - 1 x_{i}$ , $p_{k} = 1 - i = 1 \sum k - 1 p_{i}$ , and $0 \leq i = 1 \sum k - 1 x_{i} \leq n$

MGF

$M (t_{1}, t_{2}, \dots, t_{k - 1}) = (\sum_{i = 1}^{k - 1} p_{i} e^{t_{i}} + p_{k})^{n}$

Marginal PDF

Each one-variable marginal pdf is Binomial Distribution, each two-variables marginal pdf is Trinomial Distribution, and so on.
Link to original

Poission Distribution
Definition

$X \sim Pois (λ)$ where $λ$ is the average number of occurrences in a fixed interval of time

The number of occurrences in a fixed interval of time with mean $λ$

Properties

PDF

$f (x) = \frac{e ^{- λ} λ ^{x}}{x !} \in [0, 1]$ where $x \in N_{0}$

Mean

$E (X) = λ$

Variance

$Var (X) = λ$

MGF

$M (t) = exp [m (e^{t} - 1)]$

Summation

Let $X_{i} \sim Pois (λ_{i})$ , and $X_{i}$ ‘s are independent, then $i = 1 \sum n X_{i} \sim Pois (i = 1 \sum n λ_{i})$
Link to original

Normal Distribution
Definition

$X \sim N (μ, σ^{2})$ where $μ \in R$ is the location parameter(mean), and $σ^{2} \in R_{0}^{+}$ is the scale parameter(variance)

Standard Normal Distribution

$X \sim N (0, 1)$ $f (x) = \frac{1}{2 π} exp (- \frac{x ^{2}}{2}) \in R$

Properties

PDF

$f (x) = \frac{1}{2 π σ ^{2}} exp (- \frac{1}{2} (\frac{x - μ}{σ})^{2}) \in R$

Mean

$E (X) = μ$

Variance

$Var (X) = σ^{2}$

MGF

$M_{X} (t) = exp (μ t + \frac{σ ^{2} t ^{2}}{2})$

Higher Order Moments

$E (Z^{k}) = {E (Z^{2 k}) = \frac{( 2 k )!}{2 ^{k} k !} E (Z^{2 k + 1}) = 0$

$E (X^{k}) = j = 0 \sum k (j k) σ^{j} E (Z^{j}) μ^{k - j}$

Sum of Normally Distributed Random Variables

Let $X_{i} \sim N (μ_{i}, σ_{i}^{2})$ be independent random variables following normal distribution, then $i = 1 \sum n a_{i} X_{i} \sim N (i = 1 \sum n a_{i} μ_{i}, \sum_{i = 1}^{n} a_{i}^{2} σ_{i}^{2})$

Relationship with Chi-squared Distribution

Let $Z \sim N (0, 1)$ be a standard normal distribution, then $Z^{2} \sim χ^{2} (1)$

Facts

Let $X \sim N (0, σ^{2})$ , then $Var (X^{2}) = 2 σ^{4}$

Link to original

Chi-squared Distribution
Definition

$X \sim χ^{2} (k) = Γ (\frac{k}{2}, 2)$ where $k \in N$ is the degrees of freedom

squared sum of independent standard normal distributions

Properties

PDF

$f (x) = \frac{1}{Γ ( \frac{k}{2} ) 2 ^{\frac{k}{2}}} x^{\frac{k}{2} - 1} e^{- \frac{x}{2}} \in R_{0}^{+}$

Mean

$E (X) = k$

Variance

$Var (X) = 2 k$

MGF

$M (t) = (1 - 2 t)^{- k /2}$

Additivity

Let $X_{i} \sim χ^{2} (k_{i})$ ‘s are independent chi-squared distributions $i = 1 \sum n X_{i} \sim χ^{2} (i = 1 \sum n k_{i})$

Facts

Let $Q_{1} \sim χ^{2} (r_{1})$ and $Q_{2} \sim χ^{2} (r_{2})$ , and $Q = Q_{1} - Q_{2}$ is independent of $Q_{2}$ , then $Q \sim χ^{2} (r_{1} - r_{2})$

Let $Q = Q_{1} + Q_{2} + \dots + Q_{k - 1} + Q_{k}$ , where $Q, Q_{1}, Q_{2}, \dots, Q_{k}$ are quadratic forms in $x$ , where each element of the $x$ is a Random Sample from $N (μ, σ^{2})$ If $Q / σ^{2} \sim χ^{2} (r), Q_{1} / σ^{2} \sim χ^{2} (r_{1}), Q_{2} / σ^{2} \sim χ^{2} (r_{2}), \dots, Q_{k - 1} / σ^{2} \sim χ^{2} (r_{k - 1})$ , then

$Q_{1}, Q_{2}, \dots, Q_{k}$ are independent

$Q_{k} / σ^{2} \sim χ^{2} (r - r_{1} - r_{2} - \dots - r_{k - 1})$

Link to original

Student's t-Distribution
Definition

Let $W \sim N (0, 1)$ be a standard normal distribution, $V \sim χ^{2} (r)$ be a Chi-squared Distribution, and $W, V$ be independent, then $\frac{W}{V / r} \sim t (r)$ where $r \in R^{+}$ is the degrees of freedom

Properties

PDF

$f_{r} (t) = \frac{Γ ( \frac{r + 1}{2} )}{π r Γ ( \frac{r}{2} )} (1 + \frac{t ^{2}}{r})^{- (r + 1) /2} \in R$

Mean

$E (X) = 0$

Variance

$Var (X) = {\frac{r}{r - 2} \infty r > 2 1 < r \leq 1$
Link to original

F-Distribution
Definition

Let $U \sim χ^{2} (r_{1}), V \sim χ^{2} (r_{2})$ be independent random variables following Chi-squared distributions, then $\frac{U / r _{1}}{V / r _{2}} \sim F (r_{1}, r_{2})$ where $r_{1}, r_{2} \in N$ are the degrees of freedoms

Properties

PDF

${\Gamma(\frac{r_{1} + r_{2}}{2}) (\frac{r_{1}}{r_{2}})^{\frac{r_{1}}{2}} x^{\frac{r_{1}}{2}-1}} {\Gamma(\frac{r_{1}}{2}) \Gamma(\frac{r_{2}}{2}) (\frac{r_{1}}{r_{2}}x+1)^{(r_{1} + r_{2})/2}}$$ ## Mean $$E(X) = \frac{r_{2}}{r_{2} - 2},\quad r_{2}>2$$ ## Variance $$\operatorname{Var}(X) = \frac{2(r_{1}+r_{2}-2)}{r_{1} (r_{2}-2)^{2}(r_{2}-4)},\quad r_{2}>4$$$ Link to original

Random Vector
Definition

$X : (Ω, F) \to (R^{d}, R^{d}), X = (X_{1}, X_{2}, \dots, X_{d})$

Column vector whose components are random variables $X : (Ω, F) \to (R, R)$ .

$X (ω) = (X_{1}, X_{2}, \dots, X_{d}) (ω) = (X_{1} (ω), X_{2} (ω), \dots, X_{d} (ω)) = (x_{1}, x_{2}, \dots, x_{d})$
Link to original

Multivariate Normal Distribution
Definition

$X \sim N_{n} (μ, Σ)$ where $n$ is the number of dimensions, $μ$ is the vector of location parameters, and $Σ$ is the vector of scale parameters

Standard Multivariate Normal Distribution

$X \sim N_{n} (0, I_{n})$

PDF

$f (z) = (2 π)^{- \frac{n}{2}} exp (- \frac{1}{2} z^{⊺} z)$

MGF

$M_{z} (t) = exp (\frac{1}{2} t^{⊺} t)$

Properties

PDF

$f (x) = (2 π)^{- \frac{n}{2}} ∣Σ ∣^{- \frac{1}{2}} exp (- \frac{1}{2} (x - μ)^{⊺} Σ^{- 1} (x - μ))$

Mean

$E (X) = μ$

Variance

$Var (X) = Σ$

MGF

$M_{x} (t) = exp (t^{⊺} μ + \frac{1}{2} t^{⊺} Σ t)$

Affine Transformation

Let $X \sim N_{n} (μ, Σ)$ be a Random Variable following multivariate normal distribution, $A$ be a $m \times n$ matrix, and $b$ be a $m$ dimensional vector, then $A X + b \sim N_{m} (A μ + b, A Σ A^{⊺})$

Relationship with Chi-squared Distribution

Suppose $X \sim N_{n} (μ, Σ)$ be a Random Variable following multivariate normal distribution, then $Σ^{- \frac{1}{2}} (X - μ) \sim N_{n} (0, I_{n})$ $(X - μ)^{⊺} Σ^{- 1} (X - μ) \sim χ^{2} (n)$

Facts

Let $X \sim N_{n} (μ, Σ)$ , $C$ be an $m \times n$ , and $d$ be a $m$ -dimensional vector, then $C X + D \sim N_{m} (C μ + d, C Σ C^{⊺})$

Let $Y \sim N_{n} (μ, Σ)$ , $Y = (Y_{1}, Y_{2})^{⊺}$ , $μ = (μ_{1}, μ_{2})^{⊺}$ , $Σ = (Σ_{11} Σ_{21} Σ_{12} Σ_{22})$ , where $Y_{1}$ is $p \times 1$ and $Y_{2}$ is $(n - p) \times 1$ vectors. Then, $Y_{1}$ and $Y_{2}$ are independent $\Leftrightarrow Σ_{12} = O_{p \times (n - p)}$

Let $Y \sim N_{n} (μ, Σ)$ , $u = A Y$ , and $v = B Y$ , where $A$ is $m \times n$ , $B$ is $l \times n$ matrices. $u$ and $v$ are independent $\Leftrightarrow Cov (u, v) = A Σ B^{⊺} = O_{m \times l}$

Link to original

Statistical Estimation and Testing

Bias of an Estimator
Definition

Let $X_{1}, X_{2}, \dots, X_{n}$ be a Random Sample from $f (x ∣ θ)$ where $θ \in Ω$ is a parameter, and $\hat{θ} = T (X_{1}, \dots, X_{n})$ be a Statistic. $Bias (\hat{θ}) = E ((\hat{θ})) - θ$

$\forall θ \in Ω, E (\hat{θ}) = θ \Rightarrow \hat{θ}$ is unbiased estimator An estimator is unbiased if its bias is equal to zero for all values of parameter $θ$ .
Link to original

Consistency
Definition

$\hat{θ}_{n} \to p θ$

A Statistic $\hat{θ}_{n}$ is called consistent estimator of $θ$ if $\hat{θ}$ converges in probability to $θ$
Link to original

Minimum Variance Unbiased Estimator
Definition

An estimator $Y = u (X_{1}, X_{2}, \dots, X_{n})$ satisfying the following is the minimum variance unbiased estimator (MVUE) for $θ$ $(E (Y) = θ) \land (\forall T, (E (T) = θ) \Rightarrow Var (Y) \leq Var (T))$ where $T$ is an unbiased estimator

An Unbiased Estimator that has lower variance than any other unbiased estimator for the parameter $θ$

Facts

A minimum variance unbiased estimator does not always exist.

If some unbiased estimator’s variance is equal to the Rao-Cramer Lower Bound for all $θ$ , then it is a minimum variance unbiased estimator.

Link to original

Least Squares Estimator
Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The least square estimator is the estimator that minimizes $i = 1 \sum n ϵ_{i}^{2} = i = 1 \sum n (Y_{i} - β_{0} + β_{1} X_{i})^{2}$

The least square estimator of the model is $\hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}, \hat{β}_{1} = \frac{S _{X Y}}{S _{XX}}$ where $S_{X Y} = i = 1 \sum n (X_{i} - \overset{ˉ}{X}) (Y_{i} - \overset{ˉ}{Y})$ and $S_{XX} = i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}$

Estimation of $σ^{2}$

$s^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}$ where $e_{i} = (Y_{i} - \hat{Y}_{i})$ ‘s are residuals

Multiple Linear Regression Case

Consider a Multiple Linear Regression model $y = X β + ϵ$ The least square estimator is the estimator that minimizes $ϵ^{⊺} ϵ = (y - X β)^{⊺} (y - X β)$

The least square estimator of the model is $\hat{β} = (X^{⊺} X)^{- 1} X^{⊺} y$

Fitted response vector is expressed as $\hat{y} = X \hat{β} = X (X^{⊺} X)^{- 1} X^{⊺} y$ where $H := X (X^{⊺} X)^{- 1} X^{⊺}$ is called the hat matrix.

Estimation of $σ^{2}$

$s^{2} = \frac{e ^{⊺} e}{n - p}$ where $e = (e_{1}, e_{2}, \dots, e_{n})$ , $e_{i} = (Y_{i} - \hat{Y}_{i})$ ‘s are residuals, and $p$ is the number of the explanatory variables

Facts

The least square estimator $\hat{β}_{1}$ is a linear combination of $Y_{i}$ ‘s Let $w_{i} = \frac{( X - X ˉ )}{S _{XX}}$ , then $\hat{β}_{1} = i = 1 \sum n w_{i} Y_{i}$

$E (\hat{β}_{1}) = β_{1}$ $E (\hat{β}_{0}) = β_{0}$ $Var (\hat{β}_{1}) = \frac{σ ^{2}}{S _{XX}}$ $Var (\hat{β}_{0}) = σ^{2} (\frac{1}{n} + \frac{X ˉ ^{2}}{S _{XX}})$

$i = 1 \sum n e_{i} = 0$ $i = 1 \sum n X_{i} e_{i} = 0, X^{⊺} ϵ = 0$ $i = 1 \sum n \hat{Y}_{i} e_{i} = 0, \hat{y}^{⊺} ϵ = 0$ The fitted line always go through $(\overset{ˉ}{X}, \overset{ˉ}{Y})$

$E (\hat{β}) = β$ $Cov (\hat{β}) = (X^{⊺} X)^{- 1} σ^{2}$ $E (s^{2}) = σ^{2}$

Let $y = (Y_{1}, \dots, Y_{n})^{⊺}$ , where $Y_{i}$ ‘s are independent, $E (Y) = X β$ , and $σ^{2}, μ_{3}, μ_{4}$ be the 2nd, 3rd, and 4th Central Moment of $Y_{i}$ respectively. Then, $s^{2}$ is the unique non-negative quadratic Unbiased Estimator of $σ^{2}$ with minimum variance when the excess kurtosis is $(μ_{4} - 3 σ^{4}) / σ^{4} = 0$ or when the diagonal elements of the hat matrix $X (X^{⊺} X)^{- 1} X^{⊺}$ are equal.

Link to original

Maximum Likelihood Estimation
Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , then the MLE $\hat{θ}_{MLE}$ of $θ$ is estimated as $\hat{θ}_{MLE} = θ argmax L (θ ∣ x)$

Regularity Conditions

R0: The pdfs are distinct, i.e. $θ \neq = θ^{'} \Rightarrow f (x_{i} ∣ θ) \neq = f (x_{i} ∣ θ^{'})$

R1: The pdfs have same supports $\forall θ$

R2: The true value $θ_{0}$ is an interior point in $Ω$

R3: The pdf $f (x ∣ θ)$ is twice differentiable with respect to $θ \in Ω$

R4: $\frac{\partial}{\partial θ ^{2}} \int f (x ∣ θ) d x = \int \frac{\partial}{\partial θ ^{2}} f (x ∣ θ) d x$

R5: The pdf $f (x ∣ θ)$ is three times differentiable with respect to $θ \in Ω$ , $\forall θ \in Ω, \frac{\partial ^{3}}{\partial θ ^{3}} ln f (x ∣ θ) \leq M (x)$ , and $\exists c \in R, \exists M (x), \forall∣ θ - θ_{0} ∣ < c, \forall$ interior point $x, E_{θ_{0}} [M (X)] < \infty$

Properties

Functional Invariance

If $\hat{θ}$ is the MLE for $θ$ , then $g (\hat{θ})$ is the MLE of $g (θ)$

Consistency

Under R0 ~ R2 Regularity Conditions, let $θ_{0}$ be a true parameter, $f (x ∣ θ)$ is differentiable with respect to $θ \in Ω$ , then $\frac{\partial}{\partial θ} L (θ) = 0$ has a solution $\hat{θ}_{n}$ such that $\hat{θ}_{n} \to P θ_{0}$

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $\hat{θ}_{n}$ be a consistent Sequence of solutions of MLE equation $\frac{\partial l ( θ )}{\partial θ} = 0$ , and $0 < I (θ_{0}) < \infty$ , then $n (\hat{θ}_{n} - θ_{0}) \to D N (0, \frac{1}{I ( θ _{0} )})$ where $I (θ_{0})$ is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, $n I (\hat{θ}) (\hat{θ} - θ) \to D N (0, 1)$ Thus, $100 (1 - α) %$ confidence interval of for $θ$ is $(\hat{θ} - z_{α /2} \frac{1}{n I ( θ ^ )}, \hat{θ} + z_{α /2} \frac{1}{n I ( θ ^ )})$

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let $g (x)$ be a continuous function and $g^{'} (θ_{0}) \neq = 0$ , then $n (g (\hat{θ}_{n}) - g (θ_{0})) \to D N (0, \frac{g ^{'} ( θ _{0} ) ^{2}}{I ( θ _{0} )})$

Facts

Under R0 and R1 regularity conditions, let $θ_{0}$ be a true parameter, then $\forall θ \neq = θ_{0}, n \to \infty lim P_{θ_{0}} [L (θ_{0}) > L (θ)] = 1$

Link to original

Taylor Approximation

Taylor Series
Definition

Taylor Series for One Variable

Let $f : R \to R$ be an infinitely differentiable function at the point $c \in R$ , then $f (x) = n = 0 \sum \infty \frac{f ^{(n)} ( c )}{n !} (x - c)^{n} = f (a) + f^{'} (a) (x - a) + \frac{f ^{''} ( a )}{2 !} (x - a)^{2} + \dots + \frac{f ^{(m)} ( a )}{m !} (x - a)^{m} + \dots$

Taylor Series for Two Variables

Let $f : R^{2} \to R$ be an infinitely differentiable function at the point $(t_{0}, x_{0}) \in R^{2}$ , then
$&+ \frac{\partial f}{\partial t}(t_{0}, x_{0})(t-t_{0}) + \frac{\partial f}{\partial x}(t_{0}, x_{0})(x-x_{0})\\ &+ \frac{1}{2}\frac{\partial^{2} f}{\partial t^{2}}(t_{0}, x_{0})(t-t_{0})^{2} + \frac{1}{2}\frac{\partial^{2} f}{\partial x^{2}}(t_{0}, x_{0})(x-x_{0})^{2} + \frac{\partial^{2} f}{\partial t \partial x}(t_{0}, x_{0})(x-x_{0})(t-t_{0})\\ &+ \dots \end{aligned}$$ ## Approximation Let $f: \mathbb{R} \to \mathbb{R}$ be a $k$-the differentiable function at the point $c \in \mathbb{R}$ and $|x-c| \to 0$, then $$f(x) = \sum\limits_{n=0}^{k} \cfrac{f^{(n)}(c)}{n!}(x-c)^{n} + o(|x-c|^{k})$$ where $o$ is [[Big O Notation|little oh]] notation Let $f: \mathbb{R} \to \mathbb{R}$ be a $k+1$-the differentiable function at any point and $|f^{(k+1)}(x)| \leq M \leq \infty$, then by the [[Rolle's Theorem]] $$f(x) = \sum\limits_{n=0}^{k} \cfrac{f^{(n)}(c)}{n!}(x-c)^{n} + \frac{1}{(k+1)!}f^{(k+1)}(\xi)(x-c)^{k+1}$$ where $\xi \in (x, c)$ ## Maclaurin Series $$f(x) = \sum\limits_{n=0}^{\infty} \cfrac{f^{(n)}(0)}{n!}x^{n} = f(0)+f'(0)x + \cfrac{f''(0)}{2!}x^{2}+ \dots +\cfrac{f^{(m)}(0)}{m!}x^m + \dots$$ Taylor Series with $c=0$ ## Derivation from the [[Fundamental Theorem of Calculus]] Let $f$ be an infinitely [[Differentiability|differentiable]] function. By applying [[Fundamental Theorem of Calculus|FTC]] $n$ times, we can expande the function $f(x)$ as follows: $$\begin{aligned} f(x) &= \int\limits_{c}^{x} f'(x_{1}) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(x_{2}) dx_{2} + f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(x_{2}) dx_{2} dx_{1} + \int\limits_{c}^{x} f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\int\limits_{c}^{x_{2}} f'''(x_{3}) dx_{3} dx_{2} dx_{1} + \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(c) dx_{2} dx_{1} + \int\limits_{c}^{x} f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{n-1}} f^{(n)}(x_{n}) dx_{n} dx_{n-1} \dots dx_{1} + \sum_{i=0}^{n} \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} f^{(i)}(c) dx_{i} dx_{i-1} \dots dx_{1} \end{aligned}$$ The integrals inside the summation, can be simplified $$\begin{aligned} \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} f^{(i)}(c) dx_{i} dx_{i-1} \dots dx_{1} &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} 1 dx_{i} dx_{i-1} \dots dx_{1} \\ &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-2}} (x_{i-1} - c) dx_{i-1} dx_{i-2} \dots dx_{1} \\ &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-3}} \frac{1}{2}((x_{i-2} - c)^2 - (c - c)^2) dx_{i-2} dx_{i-3} \dots dx_{1} \\ &\dots \\ &= f^{(i)}(c)\frac{(x - c)^i}{i!} \end{aligned}$$ Therefore, we have $$f(x) = \underbrace{\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{n-1}} f^{(n)}(x_{n}) dx_{n} dx_{n-1} \dots dx_{1}}_{\text{Error term}} + \underbrace{\sum_{i=0}^{n} f^{(i)}(c)\frac{(x - c)^i}{i!}}_{\text{Polynomial terms}}$$$ Link to original

Delta Method
Definition

Univariate Delta Method

Let ${X_{n}}$ be a sequence of random variables satisfying $n (X_{n} - θ) \to D N (0, σ^{2})$ , $g$ be a differentiable function at $x = θ$ , and $g^{'} (θ) \neq = 0$ , then $n (g (X_{n}) - g (θ)) \to D g^{'} (θ) N (0, σ^{2}) = N (0, σ^{2} g^{'} (θ)^{2})$

Proof

By Taylor Series approximation $g (X_{n}) = g (θ) + g^{'} (θ) (X_{n} - θ) + o_{P} (∣ X_{n} - θ ∣)$ $\Rightarrow n (g (X_{n}) - g (θ)) = g^{'} (θ) n (X_{n} - θ) + o_{P} (n ∣ X_{n} - θ ∣)$

Where $n (X_{n} - θ) \to D N (0, σ^{2})$ by the assumption. By the continuous mapping theorem, $n ∣ X_{n} - θ ∣$ , which is a function of $X_{n}$ , also converges to random variable, so it is Boundedness in Probability $n ∣ X_{n} - θ ∣ = O_{P} (1)$ by the property of converging random vector

Therefore, $n (g (X_{n}) - g (θ)) = g^{'} (θ) N (0, σ^{2}) + o_{P} (O_{P} (1)) = g^{'} (θ) N (0, σ^{2}) = N (0, σ^{2} g^{'} (θ)^{2})$ by the property of sequence of random variables bounded in probability

Examples

Estimation of the sample variance of Bernoulli Distribution

Let $X_{1}, X_{2}, \dots, X_{n} \sim B er (p)$ $n (\overset{ˉ}{X}_{n} - p) \to N (0, p (1 - p))$ by CLT

$n (g (\overset{ˉ}{X}_{n}) - g (p)) \to g^{'} (p) N (0, p (1 - p))$ by Delta method

Let $g (x) = x (1 - x)$ , then $n (\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n}) - p (1 - p)) \to N (0, p (1 - p) (1 - 2 p)^{2})$

Therefore, the sample mean and variance follow such distributions $\overset{ˉ}{X}_{n} \to N (p, \frac{p ( 1 - p )}{n}) = N (μ, \frac{σ}{n})$ $\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n}) = \overset{σ}{^} \to N (p (1 - p), \frac{p ( 1 - p ) ( 1 - 2 p ) ^{2}}{n}) = N (σ, \frac{σ ( 1 - 2 p ) ^{2}}{n})$

Visualization

x-axis: $μ = p$

y_axis: $σ^{2} = p (1 - p)$

y1: $y = p (1 - p)$ ^[variance by mean]

y2: $\overset{ˉ}{X}_{n}$ ^[sample mean]

y3, y4: $\overset{ˉ}{X}_{n} (1 - \overset{ˉ}{X}_{n})$ ^[sample variance that calculated by the sample mean]

y5: first order approximated line at $(p, p (1 - p))$

If sample size $n \to \infty$ , variance of sample mean $V (\overset{ˉ}{X}_{n}) \to 0$ . So, sample variance $\overset{σ}{^}$ can be well approximated by first-order approximation.
Link to original

Newton's Method
Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation $f (x) = 0$

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n})$ The point such that the Taylor series is 0: $x_{n + 1} = x_{n} - \frac{f ( x _{n} )}{f ^{'} ( x _{n} )}$

multivariate version: $x_{n + 1} = f (x_{n}) - \nabla f (x_{n})^{- 1} f (x_{n})$

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n}) + \frac{f ^{''} ( x _{n} )}{2} (x - x_{n})^{2}$ The derivative of the quadratic approximation: $f^{'} (x_{n}) + f^{''} (x_{n}) (x - x_{n})$ The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: $x_{n + 1} = x_{n} - \frac{f ^{'} ( x _{n} )}{f ^{''} ( x _{n} )}$ multivariate version: $x_{n + 1} = x_{n} - \nabla^{2} f (x_{n})^{- 1} \nabla f (x_{n})$

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient and hessian are $\nabla f (x) = - 2 (b - A x)^{⊺}$ , $\nabla^{2} f (x) = 2 A^{⊺} A$

Then, solution is $x_{n + 1} = x_{n} + (A^{⊺} A)^{- 1} A^{⊺} (b - A x_{n})$ If $(A^{⊺} A)$ is invertible, $x_{n + 1} = (A^{⊺} A)^{- 1} A^{⊺} b$ is a Least Square solution.
Link to original

Multiple Testing

Family-Wise Error Rate
Definition

Consider a result of Multiple Testing

fact \ test result do not reject $H_{0}$ reject $H_{0}$ total
$H_{0}$ is true $U$ $V$ $m_{0}$
$H_{1}$ is true $T$ $S$ $m_{1}$
total $m - R$ $R$ $m$
where $R = V + S$ is the number of rejected null hypothesis, and $m$ is the total number of hypotheses tested.

$F W ER = P (V \geq 1∣ m_{0} > 0)$

The family-wise error rate (FWER) is the probability of making at least one type 1 error in the family Therefore, by assuring $F W ER = P (V \geq 1∣ m_{0} > 0) \leq α$ , the probability of making one or more type 1 errors in the family is controlled at level $α$
Link to original

fact \ test result	do not reject $H_{0}$	reject $H_{0}$	total
$H_{0}$ is true	$U$	$V$	$m_{0}$
$H_{1}$ is true	$T$	$S$	$m_{1}$
total	$m - R$	$R$	$m$
where $R = V + S$ is the number of rejected null hypothesis, and $m$ is the total number of hypotheses tested.

Bonferroni's Method
Definition

Consider $m$ Multiple Testing problem $H_{0 i} : μ_{1 i} = μ_{2 i}, i = 1, 2, \dots, m$ , and let $E_{i}$ be an event of type 1 error for $H_{0 i}$ . Then, by the Bonferroni Inequality, $F W ER = P (V \geq 1) = P (i = 1 ⋃ m E_{i}) \leq i = 1 \sum m P (E_{i})$ Therefore, to control Family-Wise Error Rate $P (V \geq 1)$ at $α$ , we need to control $P (E_{i}) \leq \frac{α}{m}$

The Bonferroni method is too conservative and consequently, it is not a powerful test.
Link to original

Sidak Method
Definition

Consider $m$ Multiple Testing problem $H_{0 i} : μ_{1 i} = μ_{2 i}, i = 1, 2, \dots, m$ , and let $E_{i}$ be an event of type 1 error for $H_{0 i}$ By the definition of the type 1 error in $m$ multiple testing, $F W ER = P (V \geq 1) = P (i = 1 ⋃ m E_{i}) \leq 1 - (1 - α)^{m}$ Therefore, to control Family-Wise Error Rate $P (V \geq 1)$ at $α$ , Sidak method uses $α^{*} = 1 - (1 - α)^{1/ m}$ instead of the $α$
Link to original

False Discovery Rate
Definition

$F D R = E (V / R)$ where $V$ is the number of false discoveries and $S$ is the number of discoveries

The false discovery rate (FDR) is the expected proportion of falsely rejected $H_{0}$ we control FDR at some $q$ , $F D R \leq q$

Facts

If Family-Wise Error Rate is controlled, then FDR also be controlled.

Let $p_{1} \leq p_{2} \leq \dots \leq p_{m}$ be the $p$ -values of $m$ hypotheses, and $q$ is given. If $j = 1 \leq 1 \leq m max {i ∣ p_{i} \leq \frac{i}{m} q}$ , then FDR rejects hypotheses corresponding to $p_{1} \leq p_{2} \leq \dots \leq p_{j}$

Link to original

Simple Linear Regression Model

Simple Linear Regression
Definition

$Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}, i = 1, 2, \dots, n$ where $ϵ_{i}$ ‘s are i.i.d. error terms, with $E (ϵ_{i}) = 0$ and $Var (ϵ_{i}) = σ^{2}$
Link to original

Estimation of Regression Coefficients

Least Squares Estimator
Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The least square estimator is the estimator that minimizes $i = 1 \sum n ϵ_{i}^{2} = i = 1 \sum n (Y_{i} - β_{0} + β_{1} X_{i})^{2}$

The least square estimator of the model is $\hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}, \hat{β}_{1} = \frac{S _{X Y}}{S _{XX}}$ where $S_{X Y} = i = 1 \sum n (X_{i} - \overset{ˉ}{X}) (Y_{i} - \overset{ˉ}{Y})$ and $S_{XX} = i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}$

Estimation of $σ^{2}$

$s^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}$ where $e_{i} = (Y_{i} - \hat{Y}_{i})$ ‘s are residuals

Multiple Linear Regression Case

Consider a Multiple Linear Regression model $y = X β + ϵ$ The least square estimator is the estimator that minimizes $ϵ^{⊺} ϵ = (y - X β)^{⊺} (y - X β)$

The least square estimator of the model is $\hat{β} = (X^{⊺} X)^{- 1} X^{⊺} y$

Fitted response vector is expressed as $\hat{y} = X \hat{β} = X (X^{⊺} X)^{- 1} X^{⊺} y$ where $H := X (X^{⊺} X)^{- 1} X^{⊺}$ is called the hat matrix.

Estimation of $σ^{2}$

$s^{2} = \frac{e ^{⊺} e}{n - p}$ where $e = (e_{1}, e_{2}, \dots, e_{n})$ , $e_{i} = (Y_{i} - \hat{Y}_{i})$ ‘s are residuals, and $p$ is the number of the explanatory variables

Facts

The least square estimator $\hat{β}_{1}$ is a linear combination of $Y_{i}$ ‘s Let $w_{i} = \frac{( X - X ˉ )}{S _{XX}}$ , then $\hat{β}_{1} = i = 1 \sum n w_{i} Y_{i}$

$E (\hat{β}_{1}) = β_{1}$ $E (\hat{β}_{0}) = β_{0}$ $Var (\hat{β}_{1}) = \frac{σ ^{2}}{S _{XX}}$ $Var (\hat{β}_{0}) = σ^{2} (\frac{1}{n} + \frac{X ˉ ^{2}}{S _{XX}})$

$i = 1 \sum n e_{i} = 0$ $i = 1 \sum n X_{i} e_{i} = 0, X^{⊺} ϵ = 0$ $i = 1 \sum n \hat{Y}_{i} e_{i} = 0, \hat{y}^{⊺} ϵ = 0$ The fitted line always go through $(\overset{ˉ}{X}, \overset{ˉ}{Y})$

$E (\hat{β}) = β$ $Cov (\hat{β}) = (X^{⊺} X)^{- 1} σ^{2}$ $E (s^{2}) = σ^{2}$

Let $y = (Y_{1}, \dots, Y_{n})^{⊺}$ , where $Y_{i}$ ‘s are independent, $E (Y) = X β$ , and $σ^{2}, μ_{3}, μ_{4}$ be the 2nd, 3rd, and 4th Central Moment of $Y_{i}$ respectively. Then, $s^{2}$ is the unique non-negative quadratic Unbiased Estimator of $σ^{2}$ with minimum variance when the excess kurtosis is $(μ_{4} - 3 σ^{4}) / σ^{4} = 0$ or when the diagonal elements of the hat matrix $X (X^{⊺} X)^{- 1} X^{⊺}$ are equal.

Link to original

Least Absolute Deviation Estimator
Definition

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The least absolute deviation estimator $\hat{β}_{0}, \hat{β}_{1}$ are the estimator that minimize $i = 1 \sum n ∣ ϵ_{i} ∣ = i = 1 \sum n ∣ Y_{i} - β_{0} + β_{1} X_{i} ∣$
Link to original

Gauss-Markov Theorem
Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ Let $E (ϵ_{i}) = 0$ and $\forall i \neq = j, Cov (e_{i}, e_{j}) = 0$ , then the Least Squares Estimator $\hat{β}_{0}, \hat{β}_{1}$ has the minimum variance among the all unbiased linear estimators, i.e. BLUE

Multiple Linear Regression Case

Consider a Multiple Linear Regression model $y = X β + ϵ$ Let $E (ϵ) = 0, Cov (ϵ) = I_{n} σ^{2}$ , then the Least Squares Estimator $\hat{β}$ has the minimum variance among all the unbiased and linear estimators, i.e. BLUE
Link to original

Maximum Likelihood Estimation
Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , then the MLE $\hat{θ}_{MLE}$ of $θ$ is estimated as $\hat{θ}_{MLE} = θ argmax L (θ ∣ x)$

Regularity Conditions

R0: The pdfs are distinct, i.e. $θ \neq = θ^{'} \Rightarrow f (x_{i} ∣ θ) \neq = f (x_{i} ∣ θ^{'})$

R1: The pdfs have same supports $\forall θ$

R2: The true value $θ_{0}$ is an interior point in $Ω$

R3: The pdf $f (x ∣ θ)$ is twice differentiable with respect to $θ \in Ω$

R4: $\frac{\partial}{\partial θ ^{2}} \int f (x ∣ θ) d x = \int \frac{\partial}{\partial θ ^{2}} f (x ∣ θ) d x$

R5: The pdf $f (x ∣ θ)$ is three times differentiable with respect to $θ \in Ω$ , $\forall θ \in Ω, \frac{\partial ^{3}}{\partial θ ^{3}} ln f (x ∣ θ) \leq M (x)$ , and $\exists c \in R, \exists M (x), \forall∣ θ - θ_{0} ∣ < c, \forall$ interior point $x, E_{θ_{0}} [M (X)] < \infty$

Properties

Functional Invariance

If $\hat{θ}$ is the MLE for $θ$ , then $g (\hat{θ})$ is the MLE of $g (θ)$

Consistency

Under R0 ~ R2 Regularity Conditions, let $θ_{0}$ be a true parameter, $f (x ∣ θ)$ is differentiable with respect to $θ \in Ω$ , then $\frac{\partial}{\partial θ} L (θ) = 0$ has a solution $\hat{θ}_{n}$ such that $\hat{θ}_{n} \to P θ_{0}$

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $\hat{θ}_{n}$ be a consistent Sequence of solutions of MLE equation $\frac{\partial l ( θ )}{\partial θ} = 0$ , and $0 < I (θ_{0}) < \infty$ , then $n (\hat{θ}_{n} - θ_{0}) \to D N (0, \frac{1}{I ( θ _{0} )})$ where $I (θ_{0})$ is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, $n I (\hat{θ}) (\hat{θ} - θ) \to D N (0, 1)$ Thus, $100 (1 - α) %$ confidence interval of for $θ$ is $(\hat{θ} - z_{α /2} \frac{1}{n I ( θ ^ )}, \hat{θ} + z_{α /2} \frac{1}{n I ( θ ^ )})$

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let $g (x)$ be a continuous function and $g^{'} (θ_{0}) \neq = 0$ , then $n (g (\hat{θ}_{n}) - g (θ_{0})) \to D N (0, \frac{g ^{'} ( θ _{0} ) ^{2}}{I ( θ _{0} )})$

Facts

Under R0 and R1 regularity conditions, let $θ_{0}$ be a true parameter, then $\forall θ \neq = θ_{0}, n \to \infty lim P_{θ_{0}} [L (θ_{0}) > L (θ)] = 1$

Link to original

Goodness-of-Fit of the Regression Line

Decomposition of Sum of Squares
Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The total sum of squares $SST = \sum_{i = 1}^{n} (Y_{i} - \overset{ˉ}{Y})^{2}$ is expressed by the sum of the unexplained variation $SSE$ and the explained variation $SSR$ $SST = SSE + SSR$ where

$SST = \sum_{i = 1}^{n} (Y_{i} - \overset{ˉ}{Y})^{2}$ is the total sum of squares and has the degree of freedom $(n - 1)$ ,

$SSE = \sum_{i = 1}^{n} (Y_{i} - \hat{Y})^{2}$ is the error sum of squares and has the degree of freedom $(n - 2)$ , and

$SSR = \sum_{i = 1}^{n} (\hat{Y}_{i} - \overset{ˉ}{Y})^{2}$ is the regression sum of squares and has the degree of freedom $1$

Multiple Linear Regression Case

Consider a Multiple Linear Regression model $y = X β + ϵ$

$SST = \sum_{i = 1}^{n} (Y_{i} - \overset{ˉ}{Y})^{2}$ has the degree of freedom $(n - 1)$ ,

$SSE = \sum_{i = 1}^{n} (Y_{i} - \hat{Y})^{2}$ has the degree of freedom $(n - p)$ , and

$SSR = \sum_{i = 1}^{n} (\hat{Y}_{i} - \overset{ˉ}{Y})^{2}$ has the degree of freedom $p - 1$

Facts

$SST / σ^{2} \sim χ^{2} ((n - 1), \frac{1}{2 σ ^{2}} β^{⊺} X^{⊺} (I - \frac{1}{n} J) X β)$ $SSE / σ^{2} \sim χ^{2} (n - p)$ $SSR / σ^{2} \sim χ^{2} ((p - 1), \frac{1}{2 σ ^{2}} β^{⊺} X^{⊺} (I - \frac{1}{n} J) X β)$

Link to original

Coefficient of Determination
Definition

The coefficient of determination of the linear regression model is defined as $R^{2} = \frac{SSR}{SST} = \frac{\sum _{i = 1}^{n} ( Y ^ _{i} - Y ˉ ) ^{2}}{\sum _{i = 1}^{n} ( Y _{i} - Y ˉ ) ^{2}}$

Facts

The coefficient of determination is a squared value of Pearson Correlation Coefficient

Link to original

Analysis of Variance

One-way ANOVA

One-way ANOVA is used to analyze the significance of differences of means between groups. Let an $i$ -th response of the $j$ -th group be $X_{ij} = μ_{j} + ϵ_{ij}$ where $ϵ_{ij} \sim N (0, σ^{2}), \forall i = 1, 2, \dots, n_{j}, \forall j = 1, 2, \dots, c$

We want to test the null hypothesis $H_{0} : μ_{1} = μ_{2} = \dots = μ_{c}$ i.e. there is no treatment effect. $F = \frac{j = 1 \sum c n _{j} ( X ˉ _{. j} - X ˉ _{..} ) ^{2} / ( c - 1 )}{j = 1 \sum c i = 1 \sum n _{j} ( X _{ij} - X ˉ _{. j} ) ^{2} / ( N - c )} = \frac{SSB / ( c - 1 )}{SS W / ( N - c )} \sim F (c - 1, N - c)$ where $N = j = 1 \sum c n_{j}$ is the total number of observations, $\overset{ˉ}{X}_{. j}$ represents the mean of the $j$ -th group and $\overset{ˉ}{X}_{..}$ represents the overall mean, $\frac{1}{σ ^{2}} j = 1 \sum c n_{j} (\overset{ˉ}{X}_{. j} - \overset{ˉ}{X}_{..})^{2} \sim χ^{2} (c - 1)$ of the numerator indicates between-group variance (SSB) and $\frac{1}{σ ^{2}} j = 1 \sum c i = 1 \sum n_{j} (X_{ij} - \overset{ˉ}{X}_{. j})^{2} \sim χ^{2} (N - c)$ of the denominator indicates within-group variance (SSW)

The Likelihood Ratio Test rejects $H_{0}$ if $F > F_{α} (c - 1, N - c)$
Link to original

Statistical Inference

Confidence Interval for the Mean in Simple Linear Regression

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The $100 (1 - α) %$ confidence region for the mean response $E (Y)$ , when $x$ is given, is defined as $(\hat{β}_{0} + \hat{β}_{1} x) \pm t_{α /2} (n - 2) \times s \frac{1}{n} + \frac{( x - X ˉ ) ^{2}}{S _{XX}}$ where $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - 2}$
Link to original

Prediction interval for a New Response in Linear Regression
Definition

Consider a Simple Linear Regression model $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ The prediction interval for a new response $Y (x)$ , when $x$ is given, is defined as $\hat{Y} (x) \pm t_{α /2} ((n - 2)) \times s 1 + \frac{1}{n} + \frac{( x - X ˉ ) ^{2}}{S _{XX}}$ where $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - 2}$

Facts

The prediction interval is always wider than the Confidence Interval.

Link to original

Multiple Linear Regression Model

Multiple Linear Regression
Definition

$Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{p - 1} X_{i, p - 1} + ϵ_{i}, i = 1, 2, \dots, n$ where $ϵ_{i}$ ‘s are i.i.d. error terms, with $E (ϵ_{i}) = 0$ and $Var (ϵ_{i}) = σ^{2}$

Matrix Notations

Let $β = (β_{0}, β_{1}, β_{2}, \dots, β_{p - 1})^{⊺}$ and $x_{i} = (1, X_{i 1}, X_{i 2}, \dots, X_{i, p - 1})^{⊺}$ , then the regression model can be express the model as $Y_{i} = x_{i}^{⊺} β + ϵ_{i}, i = 1, 2, \dots, n$

Let $y = [Y_{1}, Y_{2}, \dots, Y_{n}]^{⊺}$ , $X = [x_{1}^{⊺}, x_{2}^{⊺}, \dots, x_{n}^{⊺}]^{⊺}$ , $ϵ = [ϵ_{1}, ϵ_{2}, \dots, ϵ_{n}]^{⊺}$ , then the regression model can be express as $y = X β + ϵ$ where $E [ϵ] = 0$ , and $Cov [ϵ] = σ^{2} I_{n}$
Link to original

Statistical Inference

Distribution of Regression Coefficient
Definition

Distribution of Regression Coefficient

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The Least Squares Estimator $\hat{β} = (X^{⊺} X)^{- 1} X^{⊺} y$ also follows Normal Distribution. $\hat{β} \sim N_{p} (β, (X^{⊺} X)^{- 1} σ^{2})$ where $p - 1$ is the number of explanatory variables.

Marginal Distribution of Regression Coefficient

The marginal distribution of the Multivariate Normal Distribution is univariate Normal Distribution $\hat{β}_{j} = N (β_{j}, c_{j + 1} σ^{2})$ where $c_{j + 1} = diag [(X^{⊺} X)^{- 1}]_{j + 1}$
Link to original

Confidence Interval for Regression Coefficient
Definition

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The $100 (1 - α) %$ confidence interval for $β_{j}$ is defined as $\hat{β}_{j} \pm t_{α /2} (n - p) SE (\hat{β}_{j})$ where $SE (\hat{β}_{j})$ is the Standard Error for Regression Coefficient $β_{j}$ , and $p - 1$ is the number of explanatory variables.
Link to original

Standard Error for Regression Coefficient
Definition

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The standard error for regression coefficient $β_{j}$ is calculated as $SE (\hat{β}_{j}) = c_{j + 1} s$ where $c_{j + 1} = diag [(X^{⊺} X)^{- 1}]_{j + 1}$ , $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - p}$ , and $p - 1$ is the number of explanatory variables.
Link to original

Hypothesis testing for Regression Coefficient
Definition

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The null hypothesis $H_{0} : β_{j} = β_{j, 0}$ can be tested with the test statistic $t_{j} = \frac{β ^ _{j} - β _{j, 0}}{SE ( β ^ _{j} )} \sim t (n - p)$ where $SE (\hat{β}_{j})$ is the Standard Error for Regression Coefficient $β_{j}$ .
Link to original

Joint Confidence Region for Regression Coefficient
Definition

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The $100 (1 - α) %$ join confidence region for $β$ is defined as ${β ∣ (β - \hat{β})^{⊺} X^{⊺} X (β - \hat{β}) \leq p s^{2} F_{α} (p, n - p)}$ where $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - p}$ and $p - 1$ is the number of explanatory variables.

The joint confidence region is ellipsoidal shape whose center is $\hat{β}$ by the inequality.
Link to original

Simultaneous Confidence Interval for Regression Coefficient
Definition

Simultaneous confidence interval is used when computing confidence intervals for $g$ parameters $β_{1}, β_{2}, \dots, β_{g}$ simultaneously.

Joint confidence region provides the accurate elliptical area as the confidence region, but the calculations and interpretations are complex. To obtain a rectangular-shaped confidence interval, the simultaneous confidence interval is used. When multiplying the confidence intervals of each coefficient, the resulting confidence region becomes smaller than the desired area. Therefore, to obtain the “at least $100 (1 - α) %$ ” confidence region, correction method is used. Conservative methods like the Bonferroni’s Method, provide a confidence region much larger than our desired $100 (1 - α) %$ , which satisfies the condition of being at least $100 (1 - α) %$ but reduces the power of the test. To address this issue, other methods have been devised.

Bonferroni’s Method

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The $100 (1 - α) %$ Bonferroni confidence interval for $β_{j}$ is defined as $\hat{β}_{j} \pm t_{α /2 g} (n - p) SE (\hat{β}_{j})$ where $SE (\hat{β}_{j})$ is the Standard Error for Regression Coefficient $β_{j}$ and $p - 1$ is the number of explanatory variables.

Scheffe Method

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The $100 (1 - α) %$ Scheffe confidence interval for $β_{j}$ is defined as $\hat{β}_{j} \pm g F_{α} (g, n - p) SE (\hat{β}_{j})$ where $SE (\hat{β}_{j})$ is the Standard Error for Regression Coefficient $β_{j}$ and $p - 1$ is the number of explanatory variables.

Maximum Modulus Method

Assume that $y \sim N_{n} (X β, I_{n} σ^{2})$ The $100 (1 - α) %$ Maximum Modulus confidence interval for $β_{j}$ is defined as $\hat{β}_{j} \pm u_{α} (g, n - p) SE (\hat{β}_{j})$ where $u$ is a Random Variable representing the maximum of the absolute of $g$ independent random variables following $t (n - p)$ , $SE (\hat{β}_{j})$ is the Standard Error for Regression Coefficient $\hat{β}_{j}$ , and $p - 1$ is the number of explanatory variables.

Facts

Maximum Modulus method < Bonferroni’s method < Scheffe’s method Here, the second inequality holds only when the degree of freedom of comparisons is relatively small compared to the number of groups being compared.

Link to original

Confidence Interval for the Mean in Multiple Linear Regression

Consider a Multiple Linear Regression model $y = X β + ϵ$ The $100 (1 - α) %$ confidence region for the mean response $E (Y)$ , when $x_{0}$ is given, is defined as $x_{0}^{⊺} β \pm t_{α /2} (n - p) \times s x_{0}^{⊺} (X^{⊺} X)^{- 1} x_{0}$ where $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - p}$ and $p - 1$ is the number of explanatory variables
Link to original

Partial F-Test
Definition

Consider a Multiple Linear Regression model $y = X β + ϵ$ and the hypothesis $H_{0} : β_{r} = β_{r + 1} = \dots = β_{p - 1} = 0$

Define the full model as $Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{p - 1} X_{i, p - 1} + ϵ_{i}, i = 1, 2, \dots, n$ and under $H_{0}$ , define the reduced model as $Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{r - 1} X_{i, r - 1} + ϵ_{i}, i = 1, 2, \dots, n$

We reject $H_{0}$ if $SSR (F) - SSR (R)$ is large, where $SSR (F)$ and $SSR (R)$ are the regression sum of squares of the full and reduced model, respectively. By the equality $SST = SSE (F) + SSR (F) = SSE (R) + SSR (R)$ , we can test the $SSE (R) - SSE (F)$ instead of the $SSR (F) - SSR (R)$

$\frac{SSE ( R ) - SSE ( F )}{σ ^{2}} \sim χ^{2} (p - r)$ and $\frac{SSE ( F )}{σ ^{2}} \sim χ^{2} (n - p) \sim χ^{2} (n - p)$ . Therefore, $F = \frac{( SSE ( R ) - SSE ( F )) / ( df _{R} - df _{F} )}{SSE ( F ) / df _{F}} \sim F (p - r, n - p)$ where $df_{R} = n - r$ and $df_{F} = n - p$

Hence, we reject $H_{0}$ if $F > F_{α} (p - r, n - p)$ and we call it partial F-test
Link to original

The General Linear Test
Definition

Consider a Multiple Linear Regression model $y = X β + ϵ$ The general linear test can be written as $H_{0} : C β = m, H_{1} : C β \neq = m$ where $C$ is $q \times p$ matrix with rank $q$ and $m$ is $q \times 1$ vector.

e.g. for the hypothesis $H_{0} : β_{1} - β_{2} = 0, β_{3} + 2 β_{4} = 3$ , $C = [0010 - 1 0 0102], m = [0, 3]^{⊺}$

$df_{R} = n - (p - q)$ and $df_{F} = (n - p)$ . Therefore, $F = \frac{SSE ( R ) - SSE ( F )}{q \cdot MSE ( F )} \sim F (q, n - p)$ where $MSE (F) = SSE (F) / df_{F}$

Hence, we reject $H_{0}$ if $F > F_{α} (q, n - p)$

Method of Lagrange Multiplier

By the method of Lagrange multiplier, $SSE (R) - SSE (F) = (C \hat{β} - m)^{⊺} [C (X^{⊺} X)^{- 1} C^{⊺}]^{- 1} (C \hat{β} - m)$ Therefore, we can calculate the F-statistic without fitting the reduced model $F = \frac{( C β ^ - m ) ^{⊺} [ C ( X ^{⊺} X ) ^{- 1} C ^{⊺} ] ^{- 1} ( C β ^ - m )}{q \cdot MSE ( F )} \sim F (q, n - p)$
Link to original

Lack of Fit Test

Lack of Fit Test
Definition

Consider a Multiple Linear Regression model $y = X β + ϵ$ We want to test the linearity of the model $H_{0} : E (Y_{i}) = X_{i}^{⊺} β, i = 1, 2, \dots, n$

Assume that $Y_{i l} (i = 1, 2, \dots, m; l = 1, 2, \dots, n_{i})$ is the $l$ -th replication at $x_{i}$ , then The $SSE$ is decomposed to the sum of the pure error sum of squares $SSPE$ and the lack-of-fit sum of squares $SS L F$ . $SSE i = 1 \sum m l = 1 \sum n_{i} (Y_{i l} - \hat{Y}_{i})^{2} = SSPE i = 1 \sum m l = 1 \sum n_{i} (Y_{i l} - \overset{ˉ}{Y}_{i})^{2} + SS L F i = 1 \sum m l = 1 \sum n_{i} (\overset{ˉ}{Y}_{i} - \hat{Y}_{i})^{2}$

The degree of freedoms are $df_{SSE} = n - p$ , $df_{SSPE} = n - m$ , and $df_{SS L F} = m - p$ , where $n := i = 1 \sum m n_{i}$

Also, the mean squares are obtained by dividing the sum of squares by its degree of freedom $M S_{PE} = \frac{\sum _{i = 1}^{m} \sum _{l = 1}^{n_{i}} ( Y _{i l} - Y ˉ _{i} ) ^{2}}{n - m}, M S_{L F} = \frac{\sum _{i = 1}^{m} \sum _{l = 1}^{n_{i}} ( Y ˉ _{i} - Y ^ _{i} ) ^{2}}{m - p}$

Under $H_{0}$ , the $M S_{L F}$ is an unbiased estimator of $σ^{2}$ , and $M S_{PE}$ is always an unbiased estimator of $σ^{2}$ , regardless of $H_{0}$ . Therefore, we can test the hypothesis by comparing ratios of both estimators.

$\frac{S S _{L F}}{σ ^{2}} \sim χ^{2} (n - p)$ and $\frac{S S _{PE}}{σ ^{2}} = χ^{2} (n - m)$ . Therefore, $F_{L} F = \frac{M S _{L F}}{M S _{PE}} \sim F (m - p, n - m)$

Hence, we reject $H_{0}$ if $F > F_{α} (m - p, n - m)$

Facts

The lack of fit test uses data’s replication, Therefore, there should be more than one data (replication) for some $x_{i}$ ‘s.

Link to original

Miscellanea

Multicollinearity
Definition

If two or more covariates are highly correlated, in a multiple regression model, the model has a multicollinearity The multicollinearity makes the estimation of coefficient very unstable. So, the estimated coefficients are not reliable.
Link to original

Regression Diagnostics

Residual

Studentized Residual
Definition

A studentized residual is a technique used to detect outliers

Internally Studentized Residual

$r_{i} = \frac{e _{i}}{s 1 - h _{ii}}$ where $s = \frac{1}{n - p} i = 1 \sum n \overset{ϵ}{^}_{i}^{2}$

$\frac{r _{i}^{2}}{n - p} \sim Beta (\frac{1}{2}, \frac{n - p - 1}{2})$

Externally Studentized Residual

$r_{i}^{*} = \frac{e _{i}}{s _{(i)} 1 - h _{ii}} \sim (n - p - 1)$ where $s_{(i)}^{2} = \frac{1}{n - p} j = 1 \sum n \overset{ϵ}{^}_{j}^{2} \cdot 1 (j \neq = i)$ is the unbiased estimator of $σ^{2}$ based on $n - 1$ observations after deleting the $i$ -th observation

Facts

$s_{(i)}^{2} = s^{2} \frac{n - p - r _{i}^{2}}{n - p - 1}$ $r_{i}^{*} = r_{i} \frac{n - p - 1}{n - p - r _{i}^{2}}$
Link to original

Leverage (Statistics)
Consider a Multiple Linear Regression model $y = X β + ϵ$ $h_{ii} = x_{i}^{⊺} (X^{⊺} X)^{- 1} x_{i}$

$h_{ii}$ is the $i$ -th leverage, the distance between $x_{i}$ and $\overset{ˉ}{x}$

A measure of how far away the independent variable value of an observation is from those of the other observations It is used to detect outlier

Facts

$i = 1 \sum n h_{ii} = p$ , where $p - 1$ is the number of explanatory variables $\frac{1}{n} \leq h_{ii} \leq 1$ $h_{ij}^{2} \leq h_{ii} h_{jj}$

Link to original

Influence Measures

Cook's Distance
Definition

Single Observation

Consider a Multiple Linear Regression model $y = X β + ϵ$ $C_{i} = \frac{1}{p} (\hat{β} - \hat{β}_{(i)})^{⊺} Cov (\hat{β})^{- 1} (\hat{β} - \hat{β}_{(i)}) = \frac{1}{p σ ^{2}} \frac{e _{i}^{2} h _{ii}}{( 1 - h _{ii} ) ^{2}}$ where $p - 1$ is the number of explanatory variables, and $h_{ii}$ is the Leverage (Statistics)

$C_{i}$ is the Cook’s distance. The more influential the data point, the larger the Cook’s distance

Set of Observations

Consider a Multiple Linear Regression model $y = X β + ϵ$ $C_{K} = \frac{1}{p} (\hat{β} - \hat{β}_{(K)})^{⊺} Cov (\hat{β})^{- 1} (\hat{β} - \hat{β}_{(K)}) = \frac{1}{p σ ^{2}} e_{K}^{⊺} (I - H_{K})^{- 1} H_{K} (I - H_{K})^{- 1} e_{K}$ where $p - 1$ is the number of explanatory variables, and $H$ is the matrix of leverages
Link to original

Andrews-Pregibon Statistic
Definition

Single Observation

Consider a Multiple Linear Regression model $y = X β + ϵ$ $A P_{i} = (1 - h_{ii}) \frac{1 - e _{i}^{2}}{( 1 - h _{ii} ) e ^{⊺} e} = 1 - h_{ii}^{*}$ where $h_{ii}^{*} = x_{i}^{* ⊺} (X^{* ⊺} X^{*})^{- 1} x_{i}^{*}$ is the Leverage (Statistics) of the matrix $X^{*}$

$A P_{i}$ is the Andrews-Pregibon statistic. The more influential the data point, the smaller the Andrews-Pregibon statistic

Set of Observations

Consider a Multiple Linear Regression model $y = X β + ϵ$ $A P_{K} = det [I - H_{K}^{*}]$ where $H_{K}^{*} = X_{K}^{*} (X^{* ⊺} X^{*})^{- 1} X_{K}^{* ⊺}$ is the matrix of leverages
Link to original

DFBETAS
Definition

The DFBEATS consider the influence of the $i$ -th observation on $\hat{β}_{j}$

Single Observation

Consider a Multiple Linear Regression model $y = X β + ϵ$ $D FBET A S_{j, i} = \frac{β ^ _{j} - β ^ _{j (i)}}{s _{(i)} ( X ^{⊺} X ) _{jj}^{- 1}} = \frac{c _{ji}}{\sum _{l = 1}^{n} c _{j l}^{2}} \frac{e _{i}}{s _{(i)} ( 1 - h _{ii} )}$ where $s_{(i)}$ is the externally studentized residual and $c_{ji}$ is the element of the matrix $C := (X^{⊺} X)^{- 1} X^{⊺}$

Set of Observations

Consider a Multiple Linear Regression model $y = X β + ϵ$ $D FBET A S_{j, K} = \frac{β ^ _{j} - β ^ _{j (K)}}{s _{(K)} ( X ^{⊺} X ) _{jj}^{- 1}} = \frac{c _{jk}^{⊺} ( I - H _{K} ) ^{- 1} e _{K}}{s _{(K)} \sum _{l = 1}^{n} c _{j l}^{2}}$ where $H$ is the matrix of leverages and $c_{ji}$ is the element of the matrix $C := (X^{⊺} X)^{- 1} X^{⊺}$
Link to original

DFFITS
Definition

The DFFITS consider the influence of the $i$ -th observation on $\overset{y_{i}}{^}$

Single Observation

Consider a Multiple Linear Regression model $y = X β + ϵ$ $D FF I T S_{i} = \frac{∣ x _{i}^{⊺} ( β ^ - β ^ _{(i)} ) ∣}{s _{(i)} h _{ii}} = \frac{1}{s _{(i)}} \frac{∣ e _{i} ∣ h _{ii}}{1 - h _{ii}} = \frac{∣ r _{i}^{*} ∣ h _{ii}}{1 - h _{ii}}$ where $s_{(i)}$ is the externally studentized residual

Set of Observations

Consider a Multiple Linear Regression model $y = X β + ϵ$ $D FF I T S_{K} = \frac{1}{s _{(K)}} \frac{x _{i}^{⊺} ( X ^{⊺} X ) ^{- 1} X _{K}^{⊺} ( I - H _{K} ) ^{- 1} e _{K}}{h _{ii}}$ where $s_{(K)}$ is the externally studentized residual and $H$ is the matrix of leverages
Link to original

COVRATIO
Definition

The COVRATIO consider the influence of the $i$ -th observation on $Cov (\hat{β})$

Single Observation

Consider a Multiple Linear Regression model $y = X β + ϵ$ $CO V R A T I O_{i} = \frac{det ( s _{(i)}^{2} ( X _{(i)}^{⊺} X _{(i)} ) ^{- 1} )}{det ( s ^{2} ( X ^{⊺} X ) ^{- 1} )}$

Set of Observations

Consider a Multiple Linear Regression model $y = X β + ϵ$ $CO V R A T I O_{K} = \frac{det ( s _{(K)}^{2} ( X _{(K)}^{⊺} X _{(K)} ) ^{- 1} )}{det ( s ^{2} ( X ^{⊺} X ) ^{- 1} )}$
Link to original

Multicollinearity

Variance Inflation Factor
Definition

Consider a Multiple Linear Regression model $y = X β + ϵ$ $V I F_{k} = \frac{1}{1 - R _{k}^{2}}, k = 1, 2, \dots, p - 1$ where $p - 1$ is the number of explanatory variables.

The variance inflation factor (VIF) measures how much the variance of an estimated regression coefficient is increased because of Multicollinearity
Link to original

Condition Number
Definition

Consider a matrix $X$ with singular values $d_{1} \geq d_{2} \geq \dots \geq d_{p} \geq 0$ $κ (X) = \frac{d _{1}}{d _{p}}$

$κ (X)$ is the conditional number of $X$

Facts

In linear regression, if the conditional value of the matrix $X^{⊺} X$ is large, where $X$ is a design matrix, the model has a Multicollinearity problem.

Link to original

Autocorrelation and Durbin-Watson Test

Durbin-Watson Test
Definition

Test used to detect the presence of autocorrelation at lag $1$ in the residual

Assume $e_{i}$ is the residual structure of $e_{t} = ρ e_{t - 1} + ν_{t}$ $D W = \frac{\sum _{t = 2}^{T} ( e _{t} - e _{t - 1} ) ^{2}}{\sum _{t = 1}^{T} e _{t}^{2}} ≃ 2 (1 - \overset{ρ}{^})$ where $T$ is the number of observations and $\overset{ρ}{^} = \frac{t = 2 \sum T e _{t} e _{t - 1}}{t = 2 \sum T e _{t - 1}^{2}}$

$D W = 2$ indicates no autocorrelation

Link to original

Selection of Regression Models

Adjusted R_squared Value
Definition

$R_{a}^{2} = 1 - \frac{SS E _{p} / ( n - p )}{SST / ( n - 1 )} = 1 - \frac{s _{p}^{2}}{SST / ( n - 1 )}$ where $p - 1$ is the number of explanatory variables.
Link to original

Mallow's Cp
Definition

$C_{p} = \frac{SS E _{p}}{s ^{2}} - (n - 2 p) = \overline{err} + \frac{p}{n} \overset{σ}{^}^{2}$ where $p - 1$ is the number of explanatory variables, $s^{2}$ is a sample variance under the full model, and $\overline{err} = \frac{RSS}{n}$

Mallow’s $C_{p}$ is used to assess the fit of a regression model. A small value of $C_{p}$ means that the model is relatively precise.
Link to original

PRESS statistic
Definition

$PRES S_{p} = i = 1 \sum n (y_{i} - \overset{y}{^}_{i (i)})^{2} = i = 1 \sum n (\frac{e _{i}}{1 - h _{ii}})^{2}$ where $y_{i (i)}$ is an estimated value of $y_{i}$ calculated by the $n - 1$ data excluding $y_{i}$

The predicted residual error sum of squares (PRESS) is a form of cross validation used in regression analysis
Link to original

Model Validation

Cross Validation
Definition

Partition a set of data into $K$ sets $I_{j}, j = 1, 2, \dots, K$ and denote a function $κ : {1, 2, \dots, n} \to {1, 2, \dots, K}$ such that $κ (i) = j \Leftrightarrow i \in I_{j}$ . For the dataset ${(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})}$ , let $\hat{f}_{- j}$ be the estimator based on observations except $I_{j}$ , then the cross validation estimation for the prediction error is defined as $\hat{PE}_{C V} (\hat{f}) = \frac{1}{n} i = 1 \sum n L (Y_{i}, \hat{f}_{- κ (i)} (X_{i}))$ where $L$ is a loss function

Facts

If the data is partitioned into $k$ group with equal size, then it is called a $k$ -fold cross validation.

Link to original

Miscellanea in Model Selection

Jensen's Inequality
Definition

Let $f$ be a Convex Function on an interval $I$ , $X$ be a Random Variable with support $S \subset I$ , and $E (X) < \infty$ , then $f (E [X]) \leq E [f (X)]$

Facts

$\frac{1}{\frac{1}{n} \sum \frac{1}{a _{i}}} \leq (i = 1 \prod n a_{i})^{1/ n} \leq \frac{1}{n} i = 1 \sum n a_{i}$

By Jensen’s Inequality, the following relation is satisfied arithmetic mean $\leq$ geometric mean $\leq$ harmonic mean

Link to original

Kullback-Leibler Divergence
Definition

Assume that two Probability Distributions $P$ and $Q$ are given. Then the Kullback-Leibler divergence between $P$ and $Q$ is defined as $D_{K L} (P ∣∣ Q) = E_{x \sim P} [ln (\frac{p ( x )}{q ( x )})] = \int_{x} p (x) ln \frac{p ( x )}{q ( x )} d x$

Kullback-Leibler divergence measures how different two distributions are.

It also can be expressed as a difference between the cross entropy $H (p, q)$ (difference between distributions $p$ and $q$ )and entropy $H (p)$ (inherent uncertainty of $p$ ). $H (P, Q) - H (P) = \int_{x} p (x) ln \frac{1}{q ( x )} d x + \int_{x} p (x) ln p (x) d x = \int_{x} p (x) ln \frac{p ( x )}{q ( x )} d x = D_{K L} (P ∥ Q)$

Facts

Let ${P_{n}}_{n \in N}$ be a sequence of distributions. Then, $D_{K L} (P_{n} ∣∣ P) \to 0 ⟹ D_{J S} (P_{n}, P) \to 0 ⟺ δ (P_{n}, P) \to 0 ⟹ W (P_{n}, P) \to 0 ⟺ P_{n} \to D P$ The convergence of the KL-Divergence to zero implies that the JS-Divergence also converges to zero. The convergence of the JS-Divergence to zero is equivalent to the convergence of the Total Variation Distance to zero. The convergence of the Total Variation Distance to zero implies that the Wasserstein Distance also converges to zero. The convergence of the Wasserstein Distance to zero is equivalent to the Convergence in Distribution of the sequence.

Link to original

Akaike Information Criterion
Definition

$A I C = - \frac{2}{n} i = 1 \sum n ln f (X_{i} \hat{θ}) + \frac{2 p}{n}$ where $\hat{θ}$ is the MLE of the postulated model’s parameter $θ$
Link to original

Bayesian Information Criterion
Definition

$B I C = \frac{SS E _{p}}{n σ ^{2}} + \frac{p l n n}{n}$

Facts

BIC more penalize than AIC when $p$ is large
Link to original

Transformation of the Linear Regression Model

The Use of Dummy Variables

Dummy Variable
Definition

$D = {10 X \in A otherwisse$

A dummy variable is one that takes a binary variable. It is commonly used in regression analysis to represent categorical variables that have more than two categories.
Link to original

One-way ANOVA with a Regression Model

$Y = β_{0} + β_{1} D_{1} + β_{2} D_{2} + \dots + β_{c - 1} D_{c - 1} + ϵ$ where $D_{i}$ are dummy variables represent each category, and $c$ is the number of categories

In the setting, a corner-point constraint is used.

The null hypothesis $H_{0} : β_{1} = β_{2} = \dots = β_{c - 1} = 0$ i.e. there is no treatment effect, yields same test statistic as the null hypothesis of the one-way ANOVA with equal replication $H_{0} : μ_{1} = μ_{2} = \dots = μ_{c}$
Link to original

Two-way ANOVA with a Regression Model

$Y = β_{0} + i = 1 \sum a - 1 α_{i} D_{i} + j = 1 \sum b - 1 β_{j} E_{j} + i = 1 \sum a - 1 j = 1 \sum b - 1 γ_{ij} D_{i} E_{j} + ϵ$ where $D_{i}$ and $E_{j}$ are dummy variables representing categories of the two factors, $a$ is the number of categories for the first factor, and $b$ is the number of categories for the second factor

In this setting, a corner-point constraint is used for both factors.

The null hypotheses can be tested by Deviance
Link to original

Polynomial Regression

Polynomial Regression
Definition

Consider explanatory variables $X_{i}$ and a response variable $Y$ The $k$ -variables 2nd order polynomial regression model is defined as $Y = β_{0} + i = 1 \sum k β_{i} X_{i} + quadratic terms i = 1 \sum k β_{ii} X_{i}^{2} + interaction terms i < j \sum k β_{ij} X_{i} X_{j} + ϵ$

The single variable $k$ -th order polynomial regression model is defined as $Y = β_{0} + i = 1 \sum k β_{i} X^{i} ϵ$

Facts

The column vectors of the design matrix $X$ in the polynomial regression are highly correlated, so proper transformation is needed. Among them, orthogonal polynomial is often used.

Link to original

Response Surface Analysis
Definition

The goal of response surface analysis is finding the optimal condition for the response. It employs Polynomial Regression model to model response surface.

Consider explanatory variables $X_{i}$ and a response variable $Y$ . The second-order response surface model is defined as $Y = β_{0} + x^{⊺} β + x^{⊺} Bx + ϵ$ where $x = (X_{1}, X_{2}, \dots, X_{k})^{⊺}$ , $β = (β_{1}, β_{2}, \dots, β_{k})^{⊺}$ , $B = β_{11} \frac{β _{12}}{2} ⋮ \frac{β _{1 k}}{2} \frac{β _{12}}{2} β_{22} ⋮ \frac{β _{2 k}}{2} \dots \dots ⋱ \dots \frac{β _{1 k}}{2} \frac{β _{2 k}}{2} ⋮ β_{kk}$ Let the fitted model is $\hat{Y} = \hat{β}_{0} + x^{⊺} \hat{β} + x^{⊺} \hat{B} x$ . Then, the stationary point is obtained by $x_{s} = - \frac{1}{2} \hat{B}^{- 1} \hat{β}$ and the fitted value at the stationary point is $\hat{Y}_{s} = \hat{β}_{0} + x_{s}^{⊺} \hat{β} + x_{s}^{⊺} \hat{B} x_{s}$ . If $\hat{B}$ is Positive-Definite Matrix, then $\hat{Y}_{s}$ is minimum If $\hat{B}$ is Negative-Definite Matrix, then $\hat{Y}_{s}$ is maximum If $\hat{B}$ is neither Positive-Definite Matrix nor Negative-Definite Matrix, then $\hat{Y}_{s}$ is saddle point.
Link to original

Weighted Least Squares Method

Weighted Least Squares
Definition

Weighted least squares (WLS) is a generalization of Least Square to cope with Heteroskedasticity.

We assume that $Cov (ϵ) = σ^{2} W^{- 1}$ where $W = diag (w_{1}, w_{2}, \dots, w_{n})$ The weight matrix is positive definite, so there exists a non-singular matrix $P := W^{- 1/2}$ . Consider a transformed model $Py = PX β + P ϵ$ Now, the new error term $P ϵ$ is i.i.d. $Cov (P ϵ) = σ^{2} I$ And the weighted least squares estimator is obtained by $\hat{β}_{W L S} = (X^{⊺} WX)^{- 1} X^{⊺} Wy$
Link to original

Box-Cox Transformation Model

Box-Cox Transformation Model
Definition

Box-Cox transformation is useful when dealing with heteroskedasticity by stabilizing variance.

Box-Cox Transformation

$w_{i} = ⎩ ⎨ ⎧ \frac{y _{i}^{λ} - 1}{λ} ln y_{i} λ \neq = 0 λ = 0$

Box-Cox Transformation Model

The model using the Box-Cox Transformed variable as a response variable is called a Box-Cox transformation model $w (λ) = X β + ϵ$ where $ϵ \sim N (0, σ^{2} I)$
Link to original

Robust Regression

M-estimator
Definition

Consider an error (loss) function $ρ (\cdot)$ , The M-estimator is obtained by $\hat{β}_{M} = β argmin i = 1 \sum n ρ (\frac{ϵ _{i}}{s}) = β argmin i = 1 \sum n ρ (\frac{y _{i} - x _{i}^{⊺} β}{s})$ where $s$ is a robust estimator, and given by $s = \frac{med ( ∣ e _{i} - med ( e _{i} ) ∣ )}{0.6745}$

If $ρ (u) = \frac{1}{2} u^{2}$ , then it is LSE, and if $ρ (u) = ∣ u ∣$ , then it is $L_{1}$ -norm regression estimator. The derivative of $ρ$ , denoted $ψ = \frac{\partial ρ}{\partial β}$ , is called the influence function. Since $ψ$ is a non-liner in $β$ , we van not get explicit solution. Instead, we use an iterative method, called IRLS (Iterative Reweighted Least Squares).

Iterative Reweighted Least Squares

Initialize $β_{0}$ , often obtained from Ordinary Least Squares. $\hat{β}_{0} = (X^{⊺} X)^{- 1} X^{⊺} y$

Calculate the weight $w_{i, j} = \frac{ψ [( y _{i} - x _{i}^{⊺} β ^ _{j} ) / s ]}{( y _{i} - x _{i}^{⊺} β ^ _{j} ) / s}$ If $y_{i} - x_{i}^{⊺} β = 0$ , then $w_{i} = 1$

Update $β$ using Weighted Least Squares $\hat{β}_{j + 1} = (X^{⊺} W_{j} X)^{- 1} X^{⊺} W_{j} y$ where $W_{j} = diag (w_{1, j}, w_{2, j}, \dots, w_{n, j})$

Iterate step 1 to 3 until the estimator converges.

Link to original

Inverse Regression

Inverse Regression
Definition

The problem predicting $x_{0}$ for a given response $y_{0}$ is called inverse regression (calibration, discrimination).

Consider a simple linear regression model $Y = β_{0} + β_{1} X + ϵ$ The point estimation of $x_{0}$ for a given $y_{0}$ is obtained by $\overset{x}{^}_{0} = \frac{y _{0} - β ^ _{0}}{β ^ _{1}}$ Also, the $100 (1 - α) %$ confidence interval for $x_{0}$ is obtained by the second-order inequality $d^{2} [\hat{β}_{1}^{2} - \frac{t ^{2} s ^{2}}{\sum _{i = 1}^{n} ( x _{1} - x ˉ ) ^{2}}] - 2 d \hat{β}_{1} (y_{0} - \overset{y}{ˉ}) + [(y_{0} - \overset{y}{ˉ})^{2} - t^{2} s^{2} (1 + \frac{1}{n})] \leq 0$ where $t = t_{α /2} (n - 2)$ and $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ ) ^{2}}{n - 2}$ . Let $d_{1}, d_{2}$ be solutions for the inequality, and we have a confidence interval $\overset{ˉ}{X} + d_{1} \leq X_{0} \leq \overset{ˉ}{X} + d_{2}$
Link to original

Biased Estimation

James-Stein Shrinkage Method

James-Stein Estimator
Definition

The James–Stein estimator is a biased estimator of the mean of correlated Multivariate Normal Distribution.

Let $z \sim N_{p} (μ, I)$ . The James–Stein estimator is defined as $\hat{μ}_{J S} = (1 - \frac{p - 2}{z ^{⊺} z}) z$

The Least Squares Estimator $\hat{μ}_{L S} = z$ is MLE and UMVUE. However, in terms of Mean Squared Error, James-Stein estimator $\hat{μ}_{J S}$ is better than the least squares estimator $\hat{μ}_{L S}$ for $p \geq 3$ cases.
Link to original

Ridge Regression

Ridge Regression
Definition

$\hat{β}_{ridge} = (X^{⊺} X + λ I)^{- 1} X^{⊺} y = β argmin (y - X β)^{⊺} (y - X β) + λ β^{⊺} β = β argmin (y - X β)^{⊺} (y - X β), subject to β^{⊺} β \leq c$ where $λ \geq 0$ is a complexity parameter that controls the amount of shrinkage.

Ridge regression is particularly useful to mitigate the problem of Multicollinearity in linear regression

Facts

$\hat{y}_{ridge} = X \hat{β}_{ridge} = X (X^{⊺} X + λ I)^{- 1} X^{⊺} y = i = 1 \sum p u_{j} \frac{d _{j}^{2}}{d _{j}^{2} + λ} u_{j}^{⊺} y$ where $X = UD V^{⊺}$ by Singular Value Decomposition

Link to original

Principal Component Regression

Principal Component Analysis
Definition

PCA is a linear dimensionality reduction technique. The correlated variables are linearly transformed onto a new coordinate system such that the directions capturing the largest variance in the data.

Population Version

Given a random vector $x$ , we find a $α$ such that $Var (α^{⊺} x)$ is maximized: $α argmax Var (α^{⊺} x) s.t. α^{⊺} α = 1$ Equivalently, by the Method of Lagrange Multipliers with $α^{⊺} α = 1$ , $α argmax α^{⊺} Σ α - λ (α^{⊺} α - 1)$ By differentiation, the $α$ is given by the eigen value problem $Σ α = λ α$ Thus the $α$ maximizing the variance of $α^{⊺} x$ is the eigenvector corresponding to the largest Eigenvalue.

Sample Version

Given a data matrix $X$ , by Singular Value Decomposition, A matrix $X$ can be factorized as $X = UD V^{⊺}$ . By algebra, $XV = UD =: Z \Rightarrow X v_{i} = d_{i} u_{i} =: z_{i}$ , where we call $z_{i}$ the $i$ -th principal component.

Facts

Since $Var (z_{i}) = Var (X v_{i}) = \frac{d _{i}^{2}}{n}$ and $d_{1} \geq d_{2} \geq \dots \geq d_{p} \geq 0$ $Var (z_{1}) \geq Var (z_{2}) \geq \dots \geq Var (z_{p}) \geq 0$

Link to original

Principal Component Regression
Definition

$\hat{y}_{PCR} = \hat{β} Z$ where $Z_{n \times M} (M \leq p)$ is a matrix whose columns are $z_{i} := X v_{i} = d_{i} u_{i}$ . In PCR, instead of regressing the dependent variables on explanatory variables directly, principal components(PC) of the explanatory variables are used as regressors. One typically uses only first a few PCs for regression, making PCR a kind of regularized procedure.

Facts

Since $z_{m}$ are linear combination of the original $x_{j}$ , we can express the solution in therms of coefficients of the $x_{j}$ .

Link to original

PLS: Partial Least Squares

Partial Least Squares Regression
Definition

$\overset{φ}{^}_{m} = α argmax Corr^{2} (y, X a) Var (X α) subject to ∣∣ α ∣∣ = 1, α^{⊺} S \overset{φ}{^}_{l} = 0, l = 1, \dots, m - 1$ where $S = \frac{1}{n} X^{⊺} X$ is the sample covariance matrix.

Unlike PCR maximizing $Var (X a)$ only, PLS finds directions maximizing both $Var (X a)$ and $Corr^{2} (y, X a)$
Link to original

LASSO

Lasso Regression
Definition

$\hat{β}_{lasso} = β argmin (y - X β)^{⊺} (y - X β) + λ ∣∣ β ∣∣ = β argmin (y - X β)^{⊺} (y - X β), subject to ∣∣ β ∣∣ \leq c$

Lasso model assume that the coefficients of the model are sparse.
Link to original

Bayes Estimator and Biased Estimators

Bayes Theorem
Definition

Discrete Case

$P (C_{j} ∣ C) = \frac{P ( C ∣ C _{j} ) P ( C _{j} )}{P ( C )} = \frac{P ( C ∣ C _{j} ) P ( C _{j} )}{i = 1 \sum k P ( C ∣ C _{i} ) P ( C _{i} )}, j = 1, \dots, k$ where $i = 1 \sum k P (C_{i}) P (C ∣ C_{i}) = P (C)$ by Law of Total Probability

$P (C_{j})$ is called a prior probability, and $P (C_{j} ∣ C)$ is called a posterior probability

Continuous Case

$f (θ ∣ x; γ) = \frac{f ( x ∣ θ ) f ( θ ; γ )}{f ( x )} = \frac{f ( x ∣ θ ) f ( θ ; γ )}{\int f ( x ∣ θ ) f ( θ ; γ ) d θ} \propto f (x ∣ θ) f (θ; γ)$ where $θ$ is not a constant, but an unknown parameter follows a certain distribution with a parameter $γ$ .

$f (θ; γ)$ is called a prior probability, $f (x ∣ θ)$ is called a likelihood, $f (x)$ is called an evidence or marginal likelihood, and $f (θ ∣ x; γ)$ is called a posterior probability

Parameter-Centric Notation

$p (θ ∣ x) = \frac{p ( x ∣ θ ) π ( θ )}{p ( x )}$

Examples

Consider a random variable follows Binomial Distribution $X \sim B (n, θ)$ and a prior distribution follows Beta Distribution $π (θ; γ) \sim Beta (α, β)$ where $γ = (α, β)$ .

The PDFs are defined as $p (x ∣ θ) = (x n) θ^{x} (1 - θ)^{n - x}$ and $π (θ; γ) = \frac{Γ ( α + β )}{Γ ( α ) Γ ( β )} θ^{α - 1} (1 - θ)^{β - 1}$ . Then, by Bayes theorem, $p (θ ∣ x) \propto p (x ∣ θ) π (θ; γ) \propto θ^{x + α - 1} (1 - θ^{n - x + β - 1}) \sim Beta (x + α, n - x + β)$ Under Squared Error Loss, the Bayes Estimator is a mean of the posterior distribution. $\hat{δ}_{Bayes} = \frac{x + α}{n + α + β}$
Link to original

Loss Function
Definition

Let $θ \in Ω$ be a parameter, $X := u (X_{1}, X_{2}, \dots, X_{n})$ be a Statistic for the parameter $θ$ , and $δ (x)$ be a Decision Function

A loss function is a non-negative function defined as $L (θ, δ (x))$

It indicates the difference or discrepancy between $θ$ and $δ (x)$

Examples

Absolute Error Loss

Squared Error Loss

Sum of Squared Errors Loss

Cross-Entropy Loss

Goal Post Error Loss

Huber Loss

Binary Loss

Triplet Loss

Pairwise Loss

Link to original

Risk Function
Definition

$R (θ, δ) = E [L (θ, δ (x))] = \int_{X} L (θ, δ (x)) f (x ∣ θ) d x$

Risk function is an expectation of Loss Function
Link to original

Bayes Risk
Definition
$r(\theta, \delta) &= \int_{\Theta} R(\theta, \delta) \pi(\theta) d\theta = E_\theta[R(\theta, \delta)] = E_\theta[E_{x}[L(\theta, \delta(x))]]\\ &= \int_{\Theta} \int_{X} L(\theta, \delta(x))p(x|\theta) \pi(\theta) dx d\theta = \int_{X} \int_{\Theta} L(\theta, \delta(x))p(\theta|x)p(x) d\theta dx\\ &= \int_{X}E_\theta[L(\theta, \delta(x))|X=x]p(x)dx = \int_{X}\rho(x, \pi)p(x)dx \end{aligned}$$ where $L(\theta, \delta)$ is [[Loss Function]], $R(\theta, \delta)$ is [[Risk Function]], and $\rho(x, \pi)$ is a [[Posterior Risk]].$ Link to original

Bayes Estimator
Definition

$\hat{δ}_{Bayes} = δ argmin r (θ, δ) = δ argmin ρ (π, δ)$

Estimator that minimizes the Bayes Risk $r (θ, δ)$ or Posterior Risk $ρ (π, δ)$

Facts

Under Squared Error Loss, the Bayes estimator is a posterior mean, and a posterior mode under Absolute Error Loss.

Consider a regression model $y \sim N_{n} (X β, σ^{2} I)$ and the prior distribution for $β$ , $β \sim N_{p} (m, σ^{2} V)$ . Then, the Bayes estimator under the Squared Error Loss is obtained as $\hat{β}_{Bayes} = (X^{⊺} X + V^{- 1})^{- 1} (V^{- 1} m + X^{⊺} y)$ If $V = λ^{- 1} I_{p}$ for some $λ > 0$ and $m = 0$ , then the Bayes estimator is the same as ridge estimator $\hat{β}_{Bayes} = \hat{β}_{ridge} = (X^{⊺} X + λ I)^{- 1} X^{⊺} y$ If $V = λ^{- 1} (X^{⊺} X)^{- 1}$ and $m = 0$ , then the Bayes estimator is the James-Stein regression estimator $\hat{β}_{Bayes} = (1 + λ)^{- 1} \hat{β}$

Link to original

Empirical Bayes Estimator
Definition

Empirical Bayes estimator is a Bayes Estimator whose prior distribution is estimated from the data.

The parameter of the prior distribution $\overset{γ}{^}$ is estimated by Maximum Likelihood Estimation. $\overset{γ}{^} = γ argmax p (x ∣ γ) = γ argmax \int p (x ∣ θ) π (θ; γ) d θ$ And a posterior distribution is calculated with the $\overset{γ}{^}$ . $p (θ ∣ x; \overset{γ}{^}) \propto p (x ∣ θ) π (θ; \overset{γ}{^})$ Estimator is obtained using the posterior distribution.

Examples

Consider data following a Poission Distribution $X_{i} \sim Pois (λ_{i})$ and a prior distribution follows a Gamma Distribution $π (λ; γ) \sim Γ (α, β)$ where $γ = (α, β)$ with $α$ known, $β$ unknown.

Then the marginal likelihood is defined as
$p(x_{i}|\beta) &= \int \operatorname{Pois}(x|\lambda_{i})\Gamma(\lambda_{i};\alpha, \beta) d\lambda_{i} = \int \left[ \frac{e^{-\lambda_{i}}\lambda^{x_{i}}}{x_{i}!} \right]\left[ \frac{\beta^{\alpha}\lambda_{i}^{\alpha-1}e^{-\beta \lambda_{i}}}{\Gamma(\alpha)} \right]d\lambda_{i}\\ &= \binom{x_{i}+\alpha-1}{\alpha-1}\left( \frac{\beta}{\beta+1} \right)^{\alpha} \left( \frac{1}{\beta+1} \right)^{x_{i}} \sim NB\left( \alpha, \frac{\beta}{\beta+1} \right) \end{aligned}$$ And the [[Maximum Likelihood Estimation|MLE]] of $\beta$, $\hat{\beta}_\text{MLE}$ is $$\hat{\beta}_\text{MLE} = \underset{\beta}{\operatorname{argmax}} \prod_{i=1}^{n}p(x_{i}|\beta) = \frac{\alpha}{\bar{X}}$$ The posterior distribution with $\hat{\beta}_\text{MLE}$ is defined as $$p(\lambda_{i}|x; \hat{\beta}_{\text{MLE}}) \propto p(x|\lambda_{i})\pi(\lambda_{i};\alpha, \hat{\beta}) \sim \Gamma(x_{i}+\alpha, 1+\hat{\beta})$$ Under [[Squared Error Loss]], the [[Bayes Estimator]] is a mean of the posterior distribution $\Gamma(x_{i}+\alpha, 1+\hat{\beta})$. $$\hat{\delta}_{\text{Bayes}} = \frac{\bar{X}(X_{i}+\alpha)}{\bar{X} + \alpha}$$ # Facts > Assume that $\mathbf{z} \sim N_{p}(\boldsymbol{\mu}, \mathbf{I})$ and the prior distribution for $\boldsymbol{\mu}$ is $\boldsymbol{\mu} \sim N_{p}(\mathbf{0}, \sigma^{2}\mathbf{I})$. > Then, the empirical Bayes estimator under the [[Squared Error Loss]] is [[James-Stein Estimator]] $\hat{\boldsymbol{\mu}}_{JS} = \left( 1 - \cfrac{p-2}{\mathbf{z}^{\intercal}\mathbf{z}} \right)\mathbf{z}$$ Link to original

Generalized Linear Model

Exponential Family

Exponential Family
Definition

General Representation

$p (x ∣ θ) - \frac{1}{Z ( θ )} h (x) exp (t (x) \cdot θ)$ Where:

$θ$ is the vector of natural parameters.

$t (x)$ is the vector of sufficient statistics.

$h (x)$ is the non-negative volume of $x$

$Z (θ) = \int h (x) exp (t (x) \cdot θ) ν (d x)$ is the normalizer $ν (d x)$ refers to the measure of $x$ .

Canonical Form with Dispersion Parameter

$f (x ∣ θ, ϕ) = exp [\frac{x θ - b ( θ )}{a ( ϕ )} + c (x, ϕ)]$ where $a (\cdot)$ , $b (\cdot)$ , and $c (\cdot)$ are known functions, $V (μ) := b^{''} (θ)$ is called a variance function, and $ϕ$ is called a dispersion parameter. This form is mainly used for GLMs.

Examples

Distributions Normal Poisson Binomial Gamma Inverse Gaussian
Notation $N (μ, σ^{2})$ $Pois (λ)$ $B (n, π)$ $Γ (α, β)$ $I G (μ, λ)$
$a (ϕ)$ $ϕ$ $1$ $1/ n$ $ϕ$ $ϕ$
$b (θ)$ $θ^{2} /2$ $e^{θ}$ $ln (1 + e^{θ})$ $- ln (- θ)$ $- (- 2 θ)^{1/2}$
$c (x, ϕ)$ $- \frac{1}{2} [ln (2 π σ^{2}) + \frac{x ^{2}}{σ ^{2}}]$ $- ln (x!)$ $ln (x n)$ $α ln (αx) - ln Γ (α) - α ln (x)$ $- \frac{1}{2} [ln (2 πλ x^{3}) + \frac{1}{λ x}]$
$μ (θ) = b^{'} (θ)$ $θ$ $e^{θ}$ $e^{θ} / (1 + e^{θ})$ $- 1/ θ$ $(- 2 θ)^{- 1/2}$
$V (μ)$ $1$ $μ$ $μ (1 - μ)$ $μ^{2}$ $μ^{3}$
$ϕ$ $σ^{2}$ $1$ $1/ n$ $1/ α$ $λ$
Natural link Identity log logit inverse $μ^{- 2}$

Facts

For an exponential family, $E (X) = μ = b^{'} (θ)$ and $Var (X) = b^{''} (θ) a (ϕ)$ holds by Bartlett Identities

Link to original

Distributions	Normal	Poisson	Binomial	Gamma	Inverse Gaussian
Notation	$N (μ, σ^{2})$	$Pois (λ)$	$B (n, π)$	$Γ (α, β)$	$I G (μ, λ)$
$a (ϕ)$	$ϕ$	$1$	$1/ n$	$ϕ$	$ϕ$
$b (θ)$	$θ^{2} /2$	$e^{θ}$	$ln (1 + e^{θ})$	$- ln (- θ)$	$- (- 2 θ)^{1/2}$
$c (x, ϕ)$	$- \frac{1}{2} [ln (2 π σ^{2}) + \frac{x ^{2}}{σ ^{2}}]$	$- ln (x!)$	$ln (x n)$	$α ln (αx) - ln Γ (α) - α ln (x)$	$- \frac{1}{2} [ln (2 πλ x^{3}) + \frac{1}{λ x}]$
$μ (θ) = b^{'} (θ)$	$θ$	$e^{θ}$	$e^{θ} / (1 + e^{θ})$	$- 1/ θ$	$(- 2 θ)^{- 1/2}$
$V (μ)$	$1$	$μ$	$μ (1 - μ)$	$μ^{2}$	$μ^{3}$
$ϕ$	$σ^{2}$	$1$	$1/ n$	$1/ α$	$λ$
Natural link	Identity	log	logit	inverse	$μ^{- 2}$

Bartlett Identities
Definition

First Bartlett Identity

$0 = E [\frac{\partial l n f ( X ∣ θ )}{\partial θ}] = E [s (θ ∣ x)]$ where $f (X ∣ θ)$ is a Likelihood Function and $s (θ ∣ x)$ is a Score Function

Second Bartlett Identity

$0 = E [\frac{\partial ^{2} l n f ( X ∣ θ )}{\partial θ ^{2}}] + E [(\frac{\partial l n f ( X ∣ θ )}{\partial θ})^{2}]$ where $f (X ∣ θ)$ is a Likelihood Function
Link to original

Score Function
Definition

$s (θ ∣ x) := \frac{\partial l n L ( θ ∣ x )}{\partial θ}$

The gradient of the log-likelihood function with respect to the parameter vector. The score indicates the steepness of the log-likelihood function

Facts

The score will vanish at a local Extremum

Link to original

Fisher Information
Definition

Fisher Information

$I(\theta) &:= E\left[ \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} \right] = \int_{\mathbb{R}} \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} p(x, \theta)dx\\ &= -E\left[ \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} \right] = -\int_{\mathbb{R}} \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} p(x, \theta)dx\\ \end{aligned}$$ by the [[Bartlett Identities#second-bartlett-identity|second Bartlett identity]] $$I(\theta) = \operatorname{Var}\left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right) = \operatorname{Var}(s(\theta|x))$$ where $s(\theta|x)$ is a [[Score Function]] ## Fisher Information Matrix ![[Pasted image 20231224171415.png|800]] Let $\mathbf{X}$ be a [[Random Vector]] with [[Density Function|PDF]] $f(x|\boldsymbol{\theta})$, where $\boldsymbol{\theta} \in \Omega \subset R^{p}$, then the **Fisher information matrix** for on $\boldsymbol{\theta}$ is a $p \times p$ matrix defined as $$I(\boldsymbol{\theta}) := \operatorname{Cov}\left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) = E\left[ \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right)^\intercal \right] = -E\left[ \cfrac{\partial^{2}}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\intercal} \ln f(x|\boldsymbol{\theta}) \right]$$ and $jk$-th element of $I(\boldsymbol{\theta})$, $I_{jk} = - E\left[ \cfrac{\partial^{2}}{\partial \theta_{j} \partial\theta_{k}} \ln f(x|\boldsymbol{\theta}) \right]$ # Properties ## Chain Rule The information in length $n$ [[Random Sample]] $X_{1}, X_{2}, \dots, X_{n}$ is $n$ times the information in a single sample $X_{i}$ $I_\mathbf{X}(\theta) = n I_{X_{1}}(\theta)$ # Facts > In a location model, information is not dependent on a location parameter. > $$I(\theta) = \int_{-\infty}^{\infty}\left( \frac{f'(z)}{f(z)} \right)^{2} f(z)dz$$$ Link to original

Construction of GLMs

Generalized Linear Model
Definition

A generalized linear model (GLM) is a generalization of linear regression model.

The GLM consists of three elements.

Response variable: The i.i.d. response variables $Y_{i}$ follow Exponential Family.

Linear predictor: $η := X^{⊺} β$ is called a linear predictor

Link function: There exists a link function satisfies $g (μ) = η$ , where $g$ is assumed to be monotone and differentiable.

Link Functions

Link functions for binomial response.

Logit: $η = ln (\frac{μ}{1 - μ})$

Probit: $η = Φ^{- 1} (μ)$ where $Φ$ is a CDF of Normal Distribution

Complementary log-log: $η = ln [- ln (1 - μ)]$

For an Exponential Family, the link function satisfying $θ = η$ is called natural link.

Facts

The parameter $β$ is estimated by MLE. Since the score function is non-linear, it can’t have explicit solution, so we use Newton–Raphson method or Fisher’s Scoring Method.

Link to original

Estimation of Regression Coefficients

Newton's Method
Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation $f (x) = 0$

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n})$ The point such that the Taylor series is 0: $x_{n + 1} = x_{n} - \frac{f ( x _{n} )}{f ^{'} ( x _{n} )}$

multivariate version: $x_{n + 1} = f (x_{n}) - \nabla f (x_{n})^{- 1} f (x_{n})$

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n}) + \frac{f ^{''} ( x _{n} )}{2} (x - x_{n})^{2}$ The derivative of the quadratic approximation: $f^{'} (x_{n}) + f^{''} (x_{n}) (x - x_{n})$ The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: $x_{n + 1} = x_{n} - \frac{f ^{'} ( x _{n} )}{f ^{''} ( x _{n} )}$ multivariate version: $x_{n + 1} = x_{n} - \nabla^{2} f (x_{n})^{- 1} \nabla f (x_{n})$

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient and hessian are $\nabla f (x) = - 2 (b - A x)^{⊺}$ , $\nabla^{2} f (x) = 2 A^{⊺} A$

Then, solution is $x_{n + 1} = x_{n} + (A^{⊺} A)^{- 1} A^{⊺} (b - A x_{n})$ If $(A^{⊺} A)$ is invertible, $x_{n + 1} = (A^{⊺} A)^{- 1} A^{⊺} b$ is a Least Square solution.
Link to original

Fisher's Scoring Method
Definition

Fisher’s scoring method is a variation of Newton–Raphson method that uses Fisher Information instead of the Hessian Matrix

$\hat{θ}_{n + 1} = \hat{θ}_{n} - \frac{l ^{'} ( θ ^ _{n} )}{I ( θ ^ _{n} )}$ where $I (θ) = E [l^{''} (θ)]$ is the Fisher Information.
Link to original

Goodness-of-Fit Measures for GLMs

Deviance
Definition

Deviance is a goodness-of-fit statistic for Generalized Linear Model.

Let $\hat{β}_{max}$ be an estimator of $β$ under maximal model $(n = p)$ . The deviance is defined as $D = 2 ln λ = 2 [l \hat{β}_{max}; y) - l \hat{β}; y)] \sim a χ^{2} (n - p)$ where $l$ is a log-likelihood, $p$ is the number of parameter of the current model, and $n$ is the sample size.

Consider a null hypothesis $H_{0}$ : the current model is not good, can be tested with the deviation. We reject $H_{0}$ if $D > χ_{α}^{2} (n - p)$

Examples

Distribution Deviance
Normal $\sum (y_{i} - \overset{μ}{^}_{i})^{2} / σ^{2}$
Poisson $2 \sum [y_{i} lo g (y_{i} / \overset{μ}{^}_{i}) - (y_{i} - \overset{μ}{^}_{i})]$
Binomial $2 \sum [y_{i} lo g (y_{i} / \overset{μ}{^}_{i}) + (m_{i} - y_{i}) lo g [(m_{i} - y_{i}) / (m_{i} - \overset{μ}{^}_{i})]]$
Gamma $2 \sum [- lo g (y_{i} / \overset{μ}{^}_{i}) + (y_{i} - \overset{μ}{^}_{i}) / \overset{μ}{^}_{i}]$
Inverse Gaussian $\sum (y_{i} - \overset{μ}{^}_{i})^{2} / (\overset{μ}{^}_{i}^{2} y_{i})$
Link to original

Distribution	Deviance
Normal	$\sum (y_{i} - \overset{μ}{^}_{i})^{2} / σ^{2}$
Poisson	$2 \sum [y_{i} lo g (y_{i} / \overset{μ}{^}_{i}) - (y_{i} - \overset{μ}{^}_{i})]$
Binomial	$2 \sum [y_{i} lo g (y_{i} / \overset{μ}{^}_{i}) + (m_{i} - y_{i}) lo g [(m_{i} - y_{i}) / (m_{i} - \overset{μ}{^}_{i})]]$
Gamma	$2 \sum [- lo g (y_{i} / \overset{μ}{^}_{i}) + (y_{i} - \overset{μ}{^}_{i}) / \overset{μ}{^}_{i}]$
Inverse Gaussian	$\sum (y_{i} - \overset{μ}{^}_{i})^{2} / (\overset{μ}{^}_{i}^{2} y_{i})$

Pearson's Chi-squared Statistic
Definition

The Pearson’s $χ^{2}$ statistic is goodness-of-fit measure for GLM defined as $X^{2} = i = 1 \sum n \frac{( y _{i} - μ ^ _{i} ) ^{2}}{V ( μ ^ _{i} )} \sim a χ^{2} (n - p)$ where $\overset{μ}{^} = μ (θ) = b^{'} (θ)$ , $V (\overset{μ}{^}) = b^{''} (θ)$ is the variance function, $p$ is the number of parameter of the current model, and $n$ is the sample size.

Facts

Under the Gaussian distribution, the Deviance and the Pearson’s $χ^{2}$ statistic is the same and follow Chi-squared Distribution $D = X^{2} \sim χ^{2} (n - p)$

Link to original

Testing and Residuals

Goodness-of-Fit Test with Deviance
Definition

Assume that a current model with $p$ -parameters $β = (β_{0}, β_{1}, \dots, β_{q}, \dots, β_{p - 1})^{⊺}$ and consider a null hypothesis $H_{0} : β_{q} = \dots = β_{p - 1} = 0$ . Then, the test Statistic is defined as $Δ D = D_{0} - D_{1} = 2 [l (\hat{β}_{1}; y) - l (\hat{β}_{0}; y)] \sim a χ^{2} (p - q)$ where $D$ are the Deviance of the generalized linear models.

We reject $H_{0}$ if $Δ D > χ_{α}^{2} (p - q)$
Link to original

Pearson Residual
Definition

The Pearson residual is defined as $r_{P_{i}} = \frac{y _{i} - μ ^ _{i}}{V ( μ ^ _{i} )}$ where $\overset{μ}{^} = μ (θ) = b^{'} (θ)$ , $V (\overset{μ}{^}) = b^{''} (θ)$ is the variance function.

Facts

$X^{2} = i = 1 \sum n r_{P_{i}}^{2}$ where $X^{2}$ is the Pearson’s Chi-squared Statistic

Link to original

Anscombe Residual
Definition

Anscombe Transformation

$A (x) = \int V^{- 1/3} (x) d x$ where $V$ is the variance function.

Anscombe Residual

The Anscombe residual is the Anscombe Transformed Pearson Residual. The transformation make the residual follows a normal distribution. $r_{A_{i}} = \frac{A ( y _{i} ) - A ( μ ^ _{i} )}{V ( A ( μ ^ _{i} ))} = \frac{A ( y _{i} ) - A ( μ ^ _{i} )}{A ^{'} ( μ ^ _{i} ) V ( μ ^ _{i} )}$ where $\overset{μ}{^} = μ (θ) = b^{'} (θ)$ , $V (\overset{μ}{^}) = b^{''} (θ)$ is the variance function.
Link to original

Deviance Residual
Definition

The deviance residual is defined as $r_{D_{i}} = sign (y_{i} - \overset{μ}{^}_{i}) d_{i} = sign (y_{i} - \overset{μ}{^}_{i}) 2 [l \hat{β}_{max}; y_{i}) - l \hat{β}; y_{i})]$ where $\overset{μ}{^} = μ (θ) = b^{'} (θ)$ , $V (\overset{μ}{^}) = b^{''} (θ)$ is the variance function, and the Deviance $D = i = 1 \sum n d_{i}$
Link to original

ANOVA Models

Constraints for Dummy Variable
Definition

When using Dummy Variable, the design matrix $X = [1, 2, D_{1}, \dots, D_{c}]$ where $c$ is the number of groups, is not a full rank. The problem can be solved by assigning some constraints.

Sum-to-Zero Constraint

Sum-to-zero constraint uses row-wise demeaned dummy variables. $i = 1 \sum c D_{j} = 0$

With sum-to-zero constraint, each coefficient indicates discrepancy from overall mean, and the intercept term indicates overall mean.

Corner-Point Constraint

Corner-point constraint omit a base group from the model. $X = [1, D_{2}, D_{3}, \dots, D_{c}]$

With corner-point constraint, each coefficient indicates discrepancy from base group, and the intercept term indicates mean of base group.
Link to original

Deviance Test for One-way ANOVA

The null hypothesis $H_{0} : D_{1} = D_{2} = \dots = D_{c - 1} = 0$ i.e. there is no treatment effect, can be tested with the Deviance. If $σ^{2}$ is known, the test statistic is defined as $Δ D = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{n _{c}} j = 1 \sum c y_{j .}^{2} - \frac{1}{N} Y_{..}^{2}) \sim χ^{2} (c - 1)$ And reject the $H_{0}$ if $Δ D > χ_{α}^{2} (c - 1)$

If $σ^{2}$ is unknown, the test statistic is defined as $F = \frac{D _{0} - D _{1}}{( N - 1 ) - ( N - c )} / \frac{D _{1}}{N - c} \sim F (c - 1, N - c)$ And reject the $H_{0}$ if $F > F_{α} (c - 1, N - c)$
Link to original

Deviance Test for Two-way ANOVA

For a two-way ANOVA with factors $A$ and $B$ , we have three null hypotheses to test:

$H_{0 A} : α_{1} = α_{2} = α_{a - 1} = 0$ i.e. there is no treatment effect of factor $A$

$H_{0 B} : β_{1} = β_{2} = β_{b - 1} = 0$ i.e. there is no treatment effect of factor $B$

$H_{0 A B} : γ_{ij} = 0, \forall i, j$ i.e. there is no interaction effect between factor $A$ and $B$ They can be tested with the Deviance.

If $σ^{2}$ is known, the test statistic for $H_{0 A}$ is defined as $Δ D_{A} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{bn} i = 1 \sum a y_{i ..}^{2} - \frac{1}{N} Y_{...}^{2}) \sim χ^{2} (a - 1)$ And reject the $H_{0 A}$ if $Δ D_{A} > χ^{2} (a - 1)$

If $σ^{2}$ is unknown, the test statistic for $H_{0 A}$ is defined as $F_{A} = \frac{D _{0} - D _{1}}{a - 1} / \frac{D _{1}}{N - ab} \sim F (a - 1, N - ab)$ And reject the $H_{0 A}$ if $F_{A} > F (a - 1, N - ab)$

If $σ^{2}$ is known, the test statistic for $H_{0 B}$ is defined as $Δ D_{B} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{an} j = 1 \sum b y_{. j .}^{2} - \frac{1}{N} Y_{...}^{2}) \sim χ^{2} (b - 1)$ And reject the $H_{0 B}$ if $Δ D_{B} > χ^{2} (b - 1)$

If $σ^{2}$ is unknown, the test statistic for $H_{0 B}$ is defined as $F_{B} = \frac{D _{0} - D _{1}}{b - 1} / \frac{D _{1}}{N - ab} \sim F (b - 1, N - ab)$ And reject the $H_{0 B}$ if $F_{B} > F (b - 1, N - ab)$

If $σ^{2}$ is known, the test statistic for $H_{0 A B}$ is defined as $Δ D_{A B} = D_{0} - D_{1} = \frac{1}{σ ^{2}} (\frac{1}{n} i = 1 \sum a j = 1 \sum b y_{ij .}^{2} - \frac{1}{bn} i = 1 \sum a Y_{i ..}^{2} - \frac{1}{an} j = 1 \sum b Y_{. j .}^{2} + \frac{1}{N} Y_{...}^{2}) \sim χ^{2} ((a - 1) (b - 1))$ And reject the $H_{0 A B}$ if $Δ D_{A B} > χ^{2} ((a - 1) (b - 1))$

If $σ^{2}$ is unknown, the test statistic for $H_{0 A B}$ is defined as $F_{A B} = \frac{D _{0} - D _{1}}{( a - 1 ) ( b - 1 )} / \frac{D _{1}}{N - ab} \sim F ((a - 1) (b - 1), N - ab)$ And reject the $H_{0 A B}$ if $F_{A B} > F ((a - 1) (b - 1), N - ab)$
Link to original

Logistic Regression Model

Logistic Regression
Definition

Logistic regression is a Generalized Linear Model with Logit link function. Logistic regression models the log-odds of an event as a linear combination of independent variables. $logit (π) = ln (\frac{π}{1 - π}) = x^{⊺} β$ The probability is calculated as $π (x) = P (Y = 1∣ X = x) = \frac{e x p ( x ^{⊺} β )}{1 + e x p ( x ^{⊺} β )} = \frac{1}{1 + e x p ( - x ^{⊺} β )}$
Link to original

Weighted Least Squares for Binomial Distribution
Definition

Let $Y_{i} \sim B (m_{i}, π_{i}), i = 1, \dots, n$ and random variables $p_{i} = \frac{y _{i}}{m _{i}}$ and consider a variance stabilizing transformation $ψ (\cdot)$ . Then, by Taylor expansion and Delta Method $E [ψ (p_{i})] \approx ψ (π_{i})$ and $Var (ψ (p_{i})) \approx ψ^{'} (π_{i})^{2} \frac{π _{i} ( 1 - π _{i} )}{m _{i}}$ hold. Now, the Pearson $X^{2}$ is defined as $X^{2} = i = 1 \sum n \frac{[ ψ ( p _{i} ) - ψ ( π _{i} ) ] ^{2}}{ψ ^{'} ( π _{i} ) ^{2} π _{i} ( 1 - π _{i} ) / m _{i}}$ By minimizing $X^{2}$ we can find the WLS estimator $\overset{π}{^}$ . It can be seen as a Weighted Least Squares with $y_{i} = ψ (p_{i})$ , $\overset{y}{^}_{i} = ψ (π_{i})$ , and $w_{i} = [ψ^{'} (π_{i})^{2} π_{i} (1 - π_{i}) / m_{i}]^{- 1}$

Widely Used Variance Stabilizing Transformations

Logit function: $ψ (x) = ln (\frac{x}{1 - x})$

The arcsin squared-root function: $ψ (x) = arcsin (x)$

Empirical logistic function: $ψ (x) = ln (\frac{x + 0.5}{1 - x + 0.5})$

Link to original

Multinomial Logistic Regression
Definition

Multinomial logistic regression is used for polychotomous (multi-class) data.

Suppose the number of classes is $K$ $ln \frac{P ( Y = k ∣ X = x )}{P ( Y = K ∣ X = x )} = β_{k}^{⊺} x, k = 1, \dots, K - 1$ The probability is calculated as $π_{k} = P (Y = k ∣ X = x) = ⎩ ⎨ ⎧ \frac{exp ( β _{k}^{⊺} x )}{1 + \sum _{l = 1}^{K - 1} exp ( β _{l}^{⊺} x )} \frac{1}{1 + \sum _{l = 1}^{K - 1} exp ( β _{l}^{⊺} x )} k = 1, \dots, K - 1 k = K$ $1 \leq k \leq K argmax \hat{f}_{k} (x) = 1 \leq k \leq K argmax \hat{P} (Y = k, X = x)$
Link to original

Proportional Odds Model
Definition

The proportional odds model is a Generalized Linear Model used for modeling the ordinal response.

Consider the response $Y$ having $k$ possible categories. $logit (γ_{j}) = ln (\frac{γ _{j}}{1 - γ _{j}}) = α_{j} - β^{⊺} x, j = 1, \dots, k - 1$ And the cumulative probability of a response is $γ_{j} (x) = P (Y \leq j ∣ x)$ .
Link to original

Log-Linear Model

Contingency Table
Definition

$J \times K$ contingency table

$B_{1}$ $B_{2}$ $\dots$ $B_{K}$ Total
$A_{1}$ $Y_{11}$ $Y_{12}$ $\dots$ $Y_{1 K}$ $Y_{1.}$
$A_{2}$ $Y_{21}$ $Y_{22}$ $\dots$ $Y_{2 K}$ $Y_{2.}$
$⋮$ $⋮$ $⋮$ $⋱$ $⋮$ $⋮$
$A_{J}$ $Y_{J 1}$ $Y_{J 2}$ $\dots$ $Y_{J K}$ $Y_{J .}$
Total $Y_{.1}$ $Y_{.2}$ $\dots$ $Y_{. K}$ $n = Y_{..}$

Contingency table is a table displays the multivariate frequency distribution of the variables.

Examples

Poisson and Multinomial Distribution Cases

2-Dimensional Case

Consider a $J \times K$ contingency table.

No Margin Fixed

If $Y_{jk}$ ‘s are independent and $Y_{jk} \sim Pois (λ_{jk})$ , then the joint PDF becomes $f (y; λ) = \prod_{j = 1}^{J} \prod_{k = 1}^{K} \frac{λ _{jk}^{y_{jk}} e ^{- λ_{jk}}}{y _{jk} !}$

Fixed Total

Given the total $n = Y_{..} = \sum_{j = 1}^{J} \sum_{k = 1}^{K} Y_{jk}$ , the conditional distribution is Multinomial Distribution. $f (y ∣ Y_{..} = n) = n! \prod_{j = 1}^{J} \prod_{k = 1}^{K} \frac{θ _{jk}^{y_{jk}}}{y _{jk} !} \sim Mult (n, θ_{11}, θ_{12}, \dots, θ_{J K})$ where $θ_{jk}^{y_{jk}} = \frac{λ _{jk}}{\sum _{j = 1}^{J} \sum _{k = 1}^{K} λ _{jk}}$ and $\sum_{j = 1}^{J} \sum_{k = 1}^{K} θ_{jk} = 1$

One Margin Fixed

Given one fixed margin (in this case, row) $n_{j} = Y_{j .}$ , the conditional distribution of frequency of $j$ -th row is Multinomial Distribution. $f (y_{j 1}, y_{j 2}, \dots, y_{j K} ∣ y_{j .}) = y_{j .}! \prod_{k = 1}^{K} \frac{θ _{jk}^{y_{jk}}}{y _{jk} !} \sim Mult (n, θ_{j 1}, θ_{j 2}, \dots, θ_{j K})$ And if each row is independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution). $f (y ∣ y_{j .} = n_{j}, j = 1, \dots, J) = \prod_{j = 1}^{J} [y_{j .}! \prod_{k = 1}^{K} \frac{θ _{jk}^{y_{jk}}}{y _{jk} !}]$

3-Dimensional Case

Consider a $J \times K \times L$ contingency table.

No Margin Fixed

If $Y_{jk l}$ ‘s are independent and $Y_{jk l} \sim Pois (λ_{jk l})$ , then the joint PDF becomes $f (y; λ) = \prod_{j = 1}^{J} \prod_{k = 1}^{K} \prod_{l = 1}^{L} \frac{λ _{jk l}^{y_{jk l}} e ^{- λ_{jk l}}}{y _{jk l} !}$

Fixed Total

Given the total $n = Y_{...} = \sum_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{k = 1}^{L} Y_{jk l}$ , the conditional distribution is Multinomial Distribution. $f (y ∣ Y_{...} = n) = n! \prod_{j = 1}^{J} \prod_{k = 1}^{K} \prod_{l = 1}^{L} \frac{θ _{jk l}^{y_{jk l}}}{y _{jk l} !} \sim Mult (n, θ_{111}, θ_{112}, \dots, θ_{J K L})$ where $θ_{jk l}^{y_{jk l}} = \frac{λ _{jk l}}{\sum _{j = 1}^{J} \sum _{k = 1}^{K} \sum _{l = 1}^{L} λ _{jk l}}$ and $\sum_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{l = 1}^{L} θ_{jk l} = 1$

One Margin Fixed

Given one fixed margin (in this case, $L$ ) $n_{l} = Y_{.. l}$ , and each $l$ -th rows are independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution). $f (y ∣ y_{.. l} = n_{l}, l = 1, \dots, L) = \prod_{l = 1}^{L} [y_{.. l}! \prod_{j = 1}^{J} \prod_{k = 1}^{K} \frac{θ _{jk l}^{y_{jk l}}}{y _{jk l} !}]$

Two Margin Fixed

If two margins are fixed (in this case, $J$ and $L$ ) $n_{j l} = Y_{j . l}$ , and each $j$ -th and $l$ -th rows are independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution). $f (y ∣ y_{j . l} = n_{j l}, j = 1, \dots, J; l = 1, \dots, L) = \prod_{j = 1}^{J} \prod_{l = 1}^{L} [y_{j . l}! \prod_{k = 1}^{K} \frac{θ _{jk l}^{y_{jk l}}}{y _{jk l} !}]$
Link to original

	$B_{1}$	$B_{2}$	$\dots$	$B_{K}$	Total
$A_{1}$	$Y_{11}$	$Y_{12}$	$\dots$	$Y_{1 K}$	$Y_{1.}$
$A_{2}$	$Y_{21}$	$Y_{22}$	$\dots$	$Y_{2 K}$	$Y_{2.}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$	$⋮$
$A_{J}$	$Y_{J 1}$	$Y_{J 2}$	$\dots$	$Y_{J K}$	$Y_{J .}$
Total	$Y_{.1}$	$Y_{.2}$	$\dots$	$Y_{. K}$	$n = Y_{..}$

Poisson Regression
Definition

Poisson Regression

Log-linear model is a Generalized Linear Model with log link function. Log-linear model models the log expected counts of an event as a linear combination of independent variables. $ln (μ) = ln [E (y)] = x^{⊺} β$

Log-Linear Model for Contingency Table

Consider a $J \times K$ Contingency Table, then the model is defined as $η_{jk} = ln (μ_{jk}) = ln [E (y_{jk})] = μ + α_{j} + β_{k} + (α β)_{jk}$ where $μ$ is the overall effect, $α_{j}$ and $β_{k}$ are main effect, and $(α β)_{jk}$ are interaction effect.

The Deviance of the model is defined as $D = 2 [j = 1 \sum J k = 1 \sum K (y_{jk} ln (\frac{y _{jk}}{μ ^ _{jk}}) - (y_{jk} - \overset{m u}{^}_{jk}))]$ and the Pearson’s Chi-squared Statistic is defined as $X^{2} = j = 1 \sum J k = 1 \sum K \frac{( y _{jk} - μ ^ _{jk} ) ^{2}}{μ ^ _{jk}}$
Link to original

Overdispersion

Overdispersion
Definition

Overdispersion is the presence of greater observed variance than what would be expected under a given statistical model. It can happen due to heterogeneity or lack of independence between trials, and clustering or grouping in the data.

Overdispersion on Binomial Distribution

Assume that there are $k$ clusters, the number of observations of each cluster is $m$ . Let the Random Variable following Binomial Distribution $X_{i} \sim B (m, π_{i})$ be the number of successes out of $m$ observation, where $π_{i}$ is also Random Variable with mean $E (π_{i}) = π$ and variance $Var (π_{i}) = σ^{2} π (1 - π)$ . Also, let $Y = i = 1 \sum k X_{i} \sim B (n, π)$ be the total number of successes in the all clusters, where $m := mk$ . Then, the mean and variance of $Y$ is defined as $E (Y) = nπ$ $Va r (Y) = nπ (1 - π) [1 + (m - 1) σ^{2}]$ Here, $ϕ = 1 + (m - 1) σ^{2}$ is called a dispersion parameter. If $ϕ > 1$ , then overdispersion occurred.

The Lexis Ratio is also used for detecting overdispersion $Q = \frac{\sum _{i = 1}^{k} ( π ^ _{i} - π ^ ) ^{2} / ( k - 1 )}{π ^ ( 1 - π ^ ) / n}$ where $\overset{π}{^}_{i} = \frac{X _{i}}{m}$ and $\overset{π}{^} = \frac{Y}{n}$

Overdispersion on Poisson Distribution

If $Var (Y) > E (Y)$ in Poission Distribution setting, then the overdispersion might occur. Let i.i.d. $Z_{i}$ be the number of occurrences at the $i$ -th cluster, and the number of clusters $N$ is also a Random Variable. Also, let $Y = i = 1 \sum N Z_{i}$ be the total number of occurrences in $N$ clusters, where $N$ follows a Poission Distribution and independent to $Z$ . Then, the mean and variance of $Y$ is defined as $E (Y) = E (N) E (Z)$ $Var (Y) = E (N) E (Z^{2})$ If $E (Z^{2}) ≫ E (Z)$ , then the overdispersion occurs.
Link to original

Nonlinear Regression Model

Nonlinear Regression Model
Definition

Let $Y$ be a response and $X = (X_{1}, X_{2}, \dots, X_{k})^{⊺}$ be $k$ -dimensional explanatory variable. The non-linear regression model can be written as $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ where $f (X_{i}, θ)$ is a known regression function and $θ = (θ_{1}, θ_{2}, \dots, θ_{p})^{⊺}$ is a $p$ -dimensional parameter vector.

The parameter vector $θ$ can not be obtained analytically. So it estimated numerically with such methods Newton–Raphson method, Gauss-Newton Method

Facts

If the function $f$ is linear, then the model is Multiple Linear Regression.

Link to original

Estimation of Parameter

Gauss-Newton Method
Definition

Gauss-Newton method is an iterative algorithm used to solve used to solve non-linear least squares problems. This method approximates Hessian Matrix using the Jacobian Matrix.

Algorithm

Consider a non-linear regression model with $k$ -dimensional explanatory variable and $p$ -dimensional parameter vector $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ The first order Taylor expansion gives the linear approximation of the model. $y = η (θ_{0}) + J_{0} (θ - θ_{0}) + ϵ$ where $θ_{0} = (θ_{1, 0}, θ_{2, 0}, \dots, θ_{p, 0})^{⊺}$ is the initial vector for $θ$ given by domain knowledge, $η (θ_{0}) = [f (X_{1}, θ_{0}), f (X_{2}, θ_{0}), \dots, f (X_{n}, θ_{0})]^{⊺}$ is the estimation vector with $θ_{0}$ , and $J_{0} = (\frac{\partial f ( X _{i} , θ )}{\partial θ _{j}})_{θ = θ_{0}}$ is the $n \times p$ Jacobian Matrix of the function $f$ at $θ = θ_{0}$ .

Now, the approximated model has a Multiple Linear Regression. Since $(y - η (θ_{0}))$ and $J_{0}$ are constants given $θ_{0}$ , we can find LSE of $(θ - θ_{0})$ . $(\hat{θ}_{1} - \hat{θ}_{0}) = (J_{0}^{⊺} J_{0})^{- 1} J_{0}^{⊺} (y - η (θ_{0}))$ And the updating formula is defined as $\hat{θ}_{n + 1} = \hat{θ}_{n} - (J_{n}^{⊺} J_{n})^{- 1} J_{n}^{⊺} (y - η (\hat{θ}_{n}))$ The updating is repeated iteratively until it converges. $\hat{θ}_{i}$ converges to the $θ$ minimizing Sum of Squared Errors Loss with a proper initial value $θ_{0}$ .
Link to original

Inference in the Nonlinear Model

Distribution of Parameter of Nonlinear Regression
Definition

Suppose a Nonlinear Regression Model $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ If the error term $ϵ_{i}$ follows a Normal Distribution, LSE $\hat{θ}$ is the same as MLE.

For a large sample size, the distribution of the estimated parameter $\hat{θ}$ follows a Normal Distribution $\hat{θ} \to D N (θ, (\hat{J}^{⊺} \hat{J})^{- 1} σ^{2})$ where $\hat{J} = (\frac{\partial f ( X _{i} , θ )}{\partial θ _{j}})_{θ = \hat{θ}}$ is the Jacobian Matrix of $f$ and $σ^{2}$ is the variance of the error term.

The variance of the error term can be estimated by $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - Y ^ _{i} ) ^{2}}{n - p} = \frac{\sum _{i = 1}^{n} ( Y _{i} - f ( X _{i} , θ ^ ) ) ^{2}}{n - p}$
Link to original

Standard Error for Parameter of Nonlinear Regression
Definition

Suppose a Nonlinear Regression Model $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ The standard error for parameter of nonlinear regression $θ_{j}$ is calculated as $SE (\hat{θ}_{j}) = c_{j} s$ where $c_{j} = diag [(\hat{J}^{⊺} \hat{J})^{- 1}]_{j}$ , $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - f ( X _{i} , θ ^ ) ) ^{2}}{n - p}$ , and $\hat{J} = (\frac{\partial f ( X _{i} , θ )}{\partial θ _{j}})_{θ = \hat{θ}}$ is the Jacobian Matrix of $f$ .
Link to original

Joint Confidence Region for Nonlinear Regression
Definition

Suppose a Nonlinear Regression Model $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ The $100 (1 - α) %$ join confidence region for $θ$ is obtained as ${θ ∣ (θ - \hat{θ})^{⊺} \hat{J}^{⊺} \hat{J} (θ - \hat{θ}) \leq p s^{2} F_{α} (p, n - p)}$ where $s^{2} = \frac{\sum _{i = 1}^{n} ( Y _{i} - f ( X _{i} , θ ^ ) ) ^{2}}{n - p}$ , $p - 1$ is the number of explanatory variables, and $\hat{J} = (\frac{\partial f ( X _{i} , θ )}{\partial θ _{j}})_{θ = \hat{θ}}$ is the Jacobian Matrix of $f$ .
Link to original

Confidence Interval for Parameter of Nonlinear Regression
Definition

Suppose a Nonlinear Regression Model $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ The $100 (1 - α) %$ confidence interval for $θ_{j}$ is obtained as $\hat{θ}_{j} \pm t_{α /2} (n - p) SE (\hat{θ}_{j})$ where $SE (\hat{θ}_{j})$ is the Standard Error for Parameter of Nonlinear Regression $θ_{j}$ , and $p - 1$ is the number of explanatory variables.
Link to original

Hypothesis testing of Parameter of Nonlinear Regression
Definition

Suppose a Nonlinear Regression Model $Y_{i} = f (X_{i}, θ) + ϵ_{i}, i = 1, 2, \dots, n$ The null hypothesis $H_{0} : θ_{j} = θ_{j, 0}$ can be tested with the test statistic $t_{j} = \frac{θ ^ _{j} - θ _{j, 0}}{SE ( θ ^ _{j} )} \sim t (n - p)$ where $SE (\hat{θ}_{j})$ is the Standard Error for Parameter of Nonlinear Regression $θ_{j}$ , and $p - 1$ is the number of explanatory variables.
Link to original

Non-Parametric Regression

Kernel Estimator

K-Nearest Neighbors Algorithm
Definition

Suppose $d_{i} (x) := d (x, X_{i})$ is the distance between the training sample $X_{i}$ and the given point $x$ . Let $d_{(1)} (x) \leq d_{(2)} (x) \leq \dots \leq d_{(n)} (x)$ $1 \leq k \leq K argmax \hat{f}_{k} = 1 \leq j \leq K argmax p_{j} (x) = 1 \leq j \leq K argmax \frac{1}{k} i = 1 \sum n I (d_{i} (x) \leq d_{(k)} (x)) I (Y_{i} = j)$

The k-NN determines a sample’s class using the $k$ -nearest training datum.

Facts

$k$ and $d$ are hyperparameters.

Link to original

Kernel Density Estimation
Definition

$\hat{f}_{h} (x) = \frac{1}{n} i = 1 \sum n K_{h} (x - X_{i}) = \frac{1}{nh} i = 1 \sum n K (\frac{x - X _{i}}{h})$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, $K_{h} (x) := \frac{1}{h} K (\frac{x}{h})$ is the scaled kernel, and $h$ is a smoothing parameter (bandwidth, or window width)

KDE is a non-parametric method to estimate PDF of a random variable based on kernel and wrights.

Multidimensional KDE

$\hat{f}_{H} (x) = \frac{1}{n} i = 1 \sum n K_{H} (x - x_{i}) = \frac{1}{n} ∣ H ∣^{- 1} i = 1 \sum n K (H^{- 1} (x - x_{i}))$ where $K s.t. \int_{- \infty}^{\infty} K (u) d u = 1, K \geq 0$ is the kernel, $K_{H} (x) := ∣ H ∣^{- 1} K (H^{- 1} x)$ is the scaled kernel, and $H$ is a positive-definite bandwidth matrix.

Facts

Widely used kernels

Link to original

Kernel Regression
Definition

Kernel regression is an extension of Kernel Density Estimation to estimate the conditional expectation of a random variable $Y$ given $X$ . The kernel regression estimator is defined as $\hat{f}_{λ} (x) = \frac{1}{n} i = 1 \sum n K_{λ} (x - X_{i}) Y_{i} = \frac{1}{nλ} i = 1 \sum n K (\frac{x - X _{i}}{λ}) Y_{i}$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, $K_{λ} (x) := \frac{1}{h} K (\frac{x}{h})$ is the scaled kernel, and $λ$ is a smoothing parameter
Link to original

Nadaraya–Watson Kernel Regression
Definition

Nadaraya–Watson kernel regression is a Kernel Regression with the sum of the weights is $1$ . The Nadaraya–Watson kernel regression estimator is defined as $\hat{f}_{λ} (x) = \frac{\sum _{j = 1}^{n} K ( \frac{x - X _{i}}{λ} ) Y _{i}}{\sum _{j = 1}^{n} K ( \frac{x - X _{i}}{λ} )}$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, and $λ$ is a smoothing parameter

Facts

Nadaraya–Watson kernel regression estimator is the same as Local Polynomial Regression with $p = 0$

Link to original

Local Polynomial Regression
Definition

Local polynomial regression fits the polynomial function with the data near the given point. The polynomial function has a form similar to Taylor Series to exploit the relation between the coefficients and the differentiation at the given point.

The $p$ -th degree polynomial function at a point $x_{0}$ is defined as $f (x_{0}) = i = 0 \sum p β_{i} (x_{0} - x)^{i} = β_{0} + β_{1} (x_{0} - x) + \dots + β_{p} (x_{0} - x)^{p}$ where $p \in N_{0}$

Local polynomial regression estimator is obtained by the relation $\hat{f} (x) = \hat{β}_{0}$ . The coefficients are estimated by an optimization problem $\hat{β} = β argmin i = 1 \sum n ([Y_{i} - j = 0 \sum p β_{i} (X_{i} - x)^{j}]^{2} K (\frac{X _{i} - x}{λ}))$ where $β = (β_{0}, \dots, β_{p})$ , $K$ is a kernel function, and $λ$ is a smoothing parameter.

In a matrix notation, the optimization problem can be written as $\hat{β} = β argmin (y - X β)^{⊺} W (y - X β)$ where $y = (Y_{1}, \dots, Y_{n})^{⊺}$ , $X = (X_{1} - x)^{0} ⋮ (X_{n} - x)^{0} \dots ⋱ \dots (X_{1} - x)^{p} ⋮ (X_{n} - x)^{p}$ is an $n \times (p + 1)$ matrix, and $W = diag [K (\frac{X _{1} - x}{λ}), \dots, K (\frac{X _{n} - x}{λ})]$

A solution of the problem is obtained as $\hat{f}^{(j)} (x) = j! \hat{β}_{j} = j! 1_{j}^{⊺} (X^{⊺} WX)^{- 1} X^{⊺} Wy$ where $1_{j}$ is a $(p + 1)$ -vector whose $j$ -th element is $1$ and the others are $0$ .
Link to original

Series Estimator

Series Estimator
Definition

Series Estimator

Suppose a non-parametric regression function $y_{i} = f (x_{i}) + ϵ_{i}$ where $f (x) \in L_{2} [a, b]$ and let a sequence of functions ${g_{j} (x)}$ be a complete orthonormal basis. Then, the function can be written as a series. $f (x) = j = 1 \sum \infty θ_{j} g_{j} (x)$ where $θ_{j} = \int g_{j} (x) f (x) d x$ is a parameter subject to estimation.

The regression function can be approximated with the estimated parameter $\hat{θ}_{j}$ and finite sum. It is called a series estimator $\hat{f}_{λ} (x) = j = 1 \sum λ \hat{θ}_{j} g_{j} (x)$ where $λ \in N$ is a smoothing parameter.

The series estimator can be written as a form of a kernel estimator with the estimation of the parameter $\hat{θ}_{j} = \frac{1}{n} i = 1 \sum n g_{j} (x_{i}) Y_{i}$ . $\hat{f}_{λ} (x) = j = 1 \sum λ \hat{θ}_{j} g_{j} (x) = \frac{1}{n} i = 1 \sum n (j = 1 \sum λ g_{j} (x_{i}) g_{j} (x)) Y_{i} = \frac{1}{n} i = 1 \sum n K_{λ} (x, x_{i}) Y_{i}$ where $K_{λ}$ is a kernel.

Thresholding

In general, as the $j$ is larger the $∣ θ_{j} ∣$ smaller. So the terms of the series can be selected with a certain algorithm.

The hard-thresholding estimator is defined as $\hat{f}_{H} (x) = j = 1 \sum \infty \hat{θ}_{j} I_{{∣ \hat{θ}_{j} ∣ > δ}} \cdot g_{j} (x)$ where $δ$ is a thresholding parameter

And the soft-thresholding estimator is defined as $\hat{f}_{S} (x) = j = 1 \sum \infty sign (\hat{θ}_{j}) (∣ \hat{θ}_{j} ∣ - δ)_{+} \cdot g_{j} (x)$
Link to original

Multiresolution Analysis
Definition

Multiresolution analysis decomposes a signal into a coarse approximation and a series of details at different scales. $f (t) = k = - \infty \sum \infty c_{k} ϕ_{k} (t) + j = - \infty \sum \infty k = - \infty \sum \infty d_{j, k} ψ_{j, k} (t)$ where the first term $k = - \infty \sum \infty c_{k} ϕ_{k} (t)$ represents the coarse approximation and the second term $j = - \infty \sum \infty k = - \infty \sum \infty d_{j, k} ψ_{j, k} (t)$ represents the sum of details. Here, $ϕ_{k} (t)$ is a scaling function (father wavelet), $ψ_{j, k} (t)$ is a wavelet function (mother wavelet), $c_{k}$ is approximation coefficient, $d_{j, k}$ is a detail coefficient.

The set ${ϕ_{k} (x), ψ_{j, k} (x), j, k \in Z}$ is an orthonormal basis of the $L^{2}$ function space.
Link to original

Wavelet Estimator
Definition

Wavelet estimator is a kind of Series Estimator using wavelet basis. $\hat{f} (t) = k = - \infty \sum \infty \overset{c}{^}_{k} ϕ_{k} (t) + j = - \infty \sum \infty k = - \infty \sum \infty \hat{d}_{j, k} ψ_{j, k} (t)$ where $\overset{c}{^}_{k} = \frac{1}{n} i = 1 \sum n ϕ_{k} (x_{i}) Y_{i}$ and $\hat{d}_{j, k} = \frac{1}{n} i = 1 \sum n ψ_{j, k} (x_{i}) Y_{i}$

To find a optimal resolution Thresholding is used.
Link to original

Spline Estimator

Piecewise Polynomials
Definition

Suppose that knots are $ξ_{j}, j = 1, \dots, K$ . Then, the basis functions for order-M spline are defined as: $h_{j} (X) h_{M + l} (X) = X^{j - 1}, j = 1, \dots, M = (X - ξ_{l})_{+}^{M - 1}, l = 1, \dots, K$ where $(x)_{+} = {x 0 x \geq 0 x < 0$

Examples

For cubic Polynomials with two knots $ξ_{1}, ξ_{2}$ , the basis functions are: $h_{1} (X) = 1, h_{2} (X) = X, h_{3} (X) = X^{2}, h_{4} (X) = X^{3}, h_{5} (X) = (X - ξ_{1})_{+}^{3}, h_{6} (X) = (X - ξ_{2})_{+}^{3}$

Facts

Order of piecewise polynomials

constant: 1

linear: 2

quadratic: 3

cubic: 4

Link to original

Natural Cubic Splines
Definition

Suppose that knots are $ξ_{j}, j = 1, \dots, K$ . Then, the basis functions for natural cubic Spline are defined as: $N_{1} (X) = 1, N_{2} (X) = X, N_{k + 2} (X) = d_{k} (X) - d_{K - 1} (X)$ where $d_{k} (X) = \frac{( X - ξ _{k} ) _{+}^{3} - ( X - ξ _{K} ) _{+}^{3}}{ξ _{K} - ξ _{k}}$
Link to original

B-Spline
Definition

B-Splines are basis functions for spline function in which order is the same, meaning that all possible spline function can be expressed as a linear combination of B-splines.

Consider the knots $t_{0}, t_{1}, \dots t_{m}$ sorted into non-decreasing order. For the given sequence of knots, the B-splines of degree $p = 0$ is defined as $B_{j, 0} (x) = {10 t_{j} \leq x \leq t_{j + 1} otherwise$ And the higher $(p + 1)$ -degree B-splines are defined by recursion $B_{j, p} (x) = \frac{x - t _{j}}{t _{j + p} - t _{j}} B_{j, p - 1} (x) + \frac{t _{j + p + 1} - x}{t _{j + p + 1} - t _{j + 1}} B_{j + 1, p - 1} (x)$

The order $n$ spline function on a given set of knots can be expressed as $s (x) = j = 1 \sum m γ_{j} B_{j, p} (x)$
Link to original

Smoothing Splines
Definition

$RSS (f, λ) = i = 1 \sum N (y_{i} - f (x_{i}))^{2} + λ \int f^{''} (t)^{2} d t$ where $λ \geq 0$ is a smoothing parameter. If $λ = 0$ , the smoothing spline is any interpolating spline. If $λ = \infty$ , the smoothing spline is Least Square line.

Smoothing splines is a Spline basis method that avoids the knot selection problem. Among all functions $f (x) \in C^{2}$ , find one that minimizes the penalized RSS.

Facts

The unique analytic solution of smoothing spline is a Natural Cubic Splines with knots at the unique values of the $x_{i}$ ‘s

Link to original

Regression Model for Censored Data

Survival Function and Hazard Function

Survival Function
Definition

Let $T \geq 0$ be a non-negative Random Variable representing survival time with PDF $f (t)$ and CDF $F (t)$ . The survival function of the random variable is defined as $S (t) = 1 - F (t) = P (T > t)$

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
Link to original

Hazard Function
Definition

Hazard Function

Let $T \geq 0$ be a non-negative Random Variable representing survival time with PDF $f (t)$ and CDF $F (t)$ . The hazard function $0 < λ (t) < \infty$ of the random variable $T$ is defined as $λ (t) = \frac{f ( t )}{S ( t )} = lim_{Δ t \to 0} \frac{P ( T < t + Δ t ∣ T > t )}{Δ t}$ where $S (t)$ is the Survival Function.

The hazard function refers to the rate of occurring event at a given time $t$ .

Cumulative Hazard Function

The hazard function can alternatively be represented in terms of the cumulative hazard function, defined as $Λ (t) = \int_{0}^{t} λ (u) d u$

Facts

The Survival Function $S (t)$ , the Cumulative Hazard Function, the density (PDF) $f (t)$ , the Hazard Function, and the distribution function (CDF) of survival time $F (t)$ are related through $S (t) = exp [- Λ (t)] = \frac{f ( t )}{λ ( t )} = 1 - F (t)$
Link to original

Censored Data

Right Censoring
Kinds

Type 1 Censoring
Definition

Type 1 Censoring

when $c_{1} = \dots = c_{n} = t_{c}$ ( $c_{i}$ ’s are constant)

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. We observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

In a type 1 censoring setting, the censored times $C_{i}$ are fixed constants, not random variables.

The PDF of the observation $Y_{i}$ is derived as $P (Y_{i} = C_{i}, δ_{i} = 0) P (Y_{i} = T_{i}, δ_{i} = 1) = P (T_{i} > C_{i}) = S (y_{i}) = f (y_{i})$

Likelihood of Type 1 Censoring Data

The Likelihood Function of type 1 censoring data is defined as $L (θ) = i = 1 \prod n f (y_{i})^{δ_{i}} S (y_{i})^{1 - δ_{i}} = i = 1 \prod n λ (y_{i})^{δ_{i}} S (y_{i}) = i = 1 \prod n λ (y_{i})^{δ_{i}} exp [- Λ (y_{i})]$
Link to original

Type 2 Censoring
Definition

Type 2 Censoring

when $r = 3$

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. We observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

In a type 2 censoring setting, we observe first $r$ out of $n$ experiment $T_{1}, \dots, T_{r}$ . In other words, for the order statistics $T_{(i)}$ of $T_{i}$ , we only observe $T_{(1)} \leq T_{(2)} \leq \dots \leq T_{(r)}$ . Where $C_{1} = C_{2} = \dots = C_{n}$ are not constants, but random variables.

Likelihood of Type 2 Censoring Data

The likelihood function of type 2 censored data can be computed using the same equation used for type 1 censored data but computing the joint-PDF of the Order Statistic $T_{(1)} \leq \dots \leq T_{(r)}$ is easier.

The joint PDF of $T_{(1)} \leq \dots \leq T_{(r)}$ is derived as $L (θ) = \frac{n !}{( n - r )!} [i = 1 \prod r f (t_{(i)})] S (t_{(r)})^{n - r}$
Link to original

Random Censoring
Definition

Random Censoring

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times, which can be both random variable or constant. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator.

In a random censoring setting, we only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ , and the censoring times are $C_{1}, \dots, C_{n}$ are i.i.d. Random Variable follows PDF $g_{γ}$ and CDF $G_{γ}$ .

The PDF of the observation $Y_{i}$ is derived as $P (Y_{i} = t, δ_{i} = 0) P (Y_{i} = t, δ_{i} = 1) = P (C_{i} = t, T_{i} > C_{i}) = g_{γ} (t) S_{θ} (t) = P (T_{i} = t, T_{i} \leq C_{i}) = f_{θ} (t) (1 - G_{γ} (t))$ where $θ$ is the parameter of interest, and $γ$ is the: nuisance parameter

Likelihood of Random Censoring Data

The Likelihood Function of random censored data is defined as $L (θ) = i = 1 \prod n [f_{θ} (t) (1 - G_{γ} (t))]^{δ_{i}} [S_{θ} (t) g_{γ} (t)]^{1 - δ_{i}} = c \cdot i = 1 \prod n f_{θ} (t)^{δ_{i}} S_{θ} (y_{i})^{1 - δ_{i}}$ where $c$ is a constant.
Link to original
Link to original

Left Censoring
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times, and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times, which can be both random variable or constant. And let $δ_{i} = {10 T_{i} \geq C_{i} : uncensored T_{i} < C_{i} : censored$ be the censoring indicator.

In a left censoring setting, we observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = max (T_{i}, C_{i})$ . In other words the event has already occurred before it becomes the subject of observed.
Link to original

Interval-Censored Data
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times. Interval censored data is given as an interval, not an exact point of time. We only observe the interval $(L_{i}, R_{i}]$ that includes $T_{i}$ Interval-censored data is divided into four cases.

Case 1 Interval-Censored Data (Current Status Data)

Data is given the form of $[0, C_{i}]$ or $(C_{i}, \infty)$ , where $C_{i}$ is a fixed time point.

Case 2 Interval-Censored Data

Data is given the form of $(L_{i}, R_{i}]$ where $0 < L_{i} < R_{i} < \infty$ .

Double Censored Data

Data is given the form of $(L_{i}, R_{i}]$ where $0 \leq L_{i} \leq R_{i} \leq \infty$ .

If $R_{i} = \infty$ , then it is right-censored data, if $L_{i} = 0$ , then it is left-censored data, and if $L_{i} = R_{i}$ , then it is uncensored data.

Panel Data

Observations are made at discrete time points. The period between these observations can be viewed as an interval.
Link to original

Mean Imputation Method
Definition

Mean imputation method substitute the given interval of case 2 interval-censored data $(L_{i}, R_{i}]$ with the mean of the interval $Y_{i} = \frac{( L _{i} + R _{i} )}{2}$ if the interval is finite. If $R_{i} = \infty$ , the data $(L_{i}, \infty)$ is substituted with the left value $Y_{i} := L_{i}$ and treated as right-censored data.
Link to original

Estimation of Survival Function

Kaplan-Meier Estimator
Definition

Kaplan-Meier Estimator

Consider a Random Censoring case $Y_{i} = min (T_{i}, C_{i})$ . Assume that $y_{1} \leq y_{2} \leq \dots \leq y_{n}$ , and the distinct failure times are $τ_{1} < τ_{2} < \dots < τ_{k}$ where $k \leq n$ . Let $d_{j} := i = 1 \sum n I (y_{i} = τ_{j}, δ_{i} = 1)$ be the number of deaths at $τ_{j}$ , and $n_{j} := i = 1 \sum n I (y_{i} \geq τ_{j})$ be the number of alive at $τ_{j}$ where the set ${y_{i} ∣ y_{i} \geq τ_{j}}$ is called the risk set at $τ_{j}$ . We only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$

The Kaplan-Meier estimator is derived from the expression $S (t) = i = 1 \prod k P_{j}, P_{j} = P (T > τ_{j} ∣ T > τ_{j - 1})$ where $τ_{0} = 0$

General Case

As an estimator of $P_{j}$ , consider $\hat{P}_{j} = \hat{P} (T > τ_{j} ∣ T > τ_{j - 1}) = \frac{P ^ ( T > τ _{j} )}{P ^ ( T > τ _{j - 1} )} = \frac{n _{j + 1} / n _{1}}{n _{j} / n _{1}} = \frac{n _{j} - d _{j}}{n _{j}} = 1 - \frac{d _{j}}{n _{j}}$

The Kaplan-Meier estimator is defined with the estimated $\hat{P}_{j}$ ‘s $\hat{S} (t) = j = 1 \prod k \hat{P}_{j} = {j ∣ τ_{j} \leq t} \prod (1 - \frac{d _{j}}{n _{j}})$ The cumulative hazard function is estimated by Nelson-Aalen Estimator in the same logic $\hat{Λ} (t) = {j ∣ τ_{j} \leq t} \sum \frac{d _{j}}{n _{j}}$

No Ties Case

When there’s no tie in the observation, $y_{1} < y_{2} < \dots < y_{n}$ , then the failure times are equal to the observation $τ_{j} = y_{i}$ , death is equal to the censoring indicator $d_{j} = δ_{j}$ , and $n_{j} = n - j + 1$ . Thus, the Kaplan-Meier estimator is defined as $\hat{S} (t) = {j ∣ y_{j} \leq t} \prod (1 - \frac{δ _{j}}{n - j + 1}) = {j ∣ τ_{j} \leq t} \prod (1 - \frac{1}{n - j + 1})^{δ_{j}}$

Properties

Self-Consistency

An estimator $\hat{S} (t)$ is self-consistent if $\hat{S} (t) = \frac{1}{n} i = 1 \sum n [1 I (Y_{i} > t) + 0 I (Y_{i} \leq t, δ_{i} = 1) + \frac{S ^ ( t )}{S ^ ( Y _{i} )} I (Y_{i} < t, δ_{i} = 0)] = \frac{1}{n} [N_{y} (t) + i : y_{i} \leq t \sum (1 - δ_{i}) \frac{S ^ ( t )}{S ^ ( Y _{i} )}]$ where $N_{y} (t) = i = 1 \sum n I (Y_{i} > t) = # (Y_{i} > t)$

The Kaplan-Meier estimator is the unique self-consistent estimator for $t < Y_{(n)}$ where $Y_{(n)}$ is the largest observation.

Generalized MLE

The Kaplan-Meier estimator gives the Generalized Maximum Likelihood Estimation of the Survival Function $S$ .

Strong Consistency

The Kaplan-Meier estimator uniformly Almost Surely converges to $S (t)$ $\hat{S} (t) \to a . s . S (t), \forall t \in R^{+}$

Proof

Consider a function $S^{*} (t) = P (Y > t)$ and decompose it to the sum of the subsurvival functions $S_{u}^{*} (t)$ and $S_{c}^{*} (t)$ . $S^{*} (t) = S_{u}^{*} (t) + S_{c}^{*} (t)$ where $S_{u}^{*} (t) = P (Y > t, δ = 1)$ is the uncensored case and $S_{c}^{*} (t) = P (Y > t, δ = 0)$ is the censored case.

Then, the survival function $S (t) = P (T > t)$ can be expressed as a function of the subsurvival functions. $S (t) = Ψ (S_{u}^{*}, S_{c}^{*}, t)$

Define the empirical subsurvival functions $\hat{S}_{u}^{*} (t) = \frac{1}{n} i = 1 \sum n I (Y_{i} > t, δ_{i} = 1)$ and $\hat{S}_{c}^{*} (t) = \frac{1}{n} i = 1 \sum n I (Y_{i} > t, δ_{i} = 0)$ . The Kaplan-Meier estimator also can be expressed as a function of the empirical subsurvival functions. $\hat{S} (t) = Ψ (\hat{S}_{u}^{*}, \hat{S}_{c}^{*}, t)$

By Glivenko-Cantelli theorem, $\hat{S}_{u}^{*} (t) \to a . s . S_{u}^{*} (t)$ and $\hat{S}_{c}^{*} (t) \to a . s . S_{c}^{*} (t)$ for all $t \in R^{+}$ . Since $Ψ$ is a continuous function of $S_{u}^{*} (t)$ and $S_{c}^{*} (t)$ , $\hat{S} (t) = Ψ (\hat{S}_{u}^{*}, \hat{S}_{c}^{*}, t) \to a . s . Ψ (S_{u}^{*}, S_{c}^{*}, t) = S (t), \forall t \in R^{+}$

Asymptotic Normality

Kaplan-Meier estimator has asymptotic normality. $n (\hat{S} (t) - S (t)) \to D N (0, S (t)^{2} \int_{0}^{t} \frac{d F _{u} ( X )}{( 1 - H ( x ) ) ^{2}})$ where $F_{u} (t) = P (T \leq t, δ = 1) = \int_{0}^{t} (1 - G (x)) d F (x)$ , $C \sim G$ , and $Y \sim H$ .

The variance of the estimator is estimated by Greenwood’s formula $σ_{S}^{2} (t) \approx \hat{S} (t)^{2} {j ∣ τ_{j} \leq t} \sum \frac{d _{j}}{n _{j} ( n _{j} - d _{j} )}$ For the no ties case, the formula is $σ_{S}^{2} (t) \approx \hat{S} (t)^{2} {j ∣ τ_{j} \leq t} \sum \frac{δ _{j}}{( n - j ) ( n - j + 1 )}$

Examples

$n = 8$ case where ${\times \circ δ_{i} = 1 δ_{i} = 0$

Facts

Kaplan-Meier estimator has Self-Consistency and Asymptotic Normality, and it is generalized MLE

If no censoring, Kaplan-Meier estimator is just the Empirical Survival Function.

Link to original

Kernel Estimation for Survival Analysis
Definition

Kaplan-Meier estimator is a step function. So it is difficult to calculate its quantile function and Density Function. The Kernel Density Estimation is used to make it smooth function.

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time with a distribution $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time with a distribution $G$ . We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator.

Without Censoring

For the complete data, the Kernel Density Estimation is defined as $\hat{f} (t) = \frac{1}{n} i = 1 \sum n K_{λ} (t - Y_{i}) = \frac{1}{nλ} i = 1 \sum n K (\frac{t - Y _{i}}{λ})$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, $K_{λ} (x) := \frac{1}{h} K (\frac{x}{h})$ is the scaled kernel, and $λ$ is a smoothing parameter.

The kernel estimator for the Distribution Function is defined as $\hat{F} (t) = \int_{- \infty}^{t} \hat{f} (u) d u = \frac{1}{n} i = 1 \sum n W (\frac{t - Y _{i}}{λ})$ where $W (t) = \int_{- \infty}^{t} K (u) d u$

With Censoring

For the censored data, the weights for each observation is defined as a jump size in Kaplan-Meier Estimator. $\hat{f} (t) = i = 1 \sum n S_{i} K_{λ} (t - Y_{i})$ $\hat{F} (t) = i = 1 \sum n S_{i} W (\frac{t - Y _{i}}{h})$ where $S_{i}$ is the jump size at $Y_{i}$ in Kaplan-Meier Estimator.

Thus, the Kernel Density Estimation for the Survival Function is $1 - i = 1 \sum n S_{i} W (\frac{t - Y _{i}}{h})$
Link to original

Cox Regression Model

Cox Proportional Hazards Model
Definition

Cox proportional hazards model assume that covariates affect the Hazard Function.

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator, and have covariates $x_{i} = (x_{i 1}, \dots, x_{i p})$ . Then, the Cox proportional hazards model is defined as $λ (t : x) = λ_{0} (t) exp (x^{⊺} β)$ where $λ_{0}$ is called the baseline hazard function, i.e. hazard at $x = 0$

Conditional Likelihood

Let $Y_{1} < Y_{2} < \dots < Y_{n}$ (no ties case), and $R_{i}$ be the risk set. For each uncensored time $Y_{i}$ , $P {a death in [y_{i}, y_{i} + Δ) ∣ R_{i}} \approx j \in R_{i} \sum λ_{0} (y_{j}) exp (x_{j}^{⊺} β) Δ$ Therefore, $P {a death of i at time y_{i} ∣ one death in R_{i} at time y_{i}} = \frac{λ _{0} ( y _{i} ) e x p ( x _{i}^{⊺} β ) Δ}{j \in R _{i} \sum λ _{0} ( y _{i} ) e x p ( x _{j}^{⊺} β ) Δ} \approx \frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )}$ Taking the product of these conditional probabilities gives a conditional likelihood $L_{C} (β) = \prod_{i : I_{u}} \frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )} = \prod_{i = 1}^{n} [\frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )}]^{δ_{i}}$ where $I_{u}$ is the indicator set for uncensored samples.

The $L_{C} (β)$ is not a likelihood. However, Cox suggested treating the conditional likelihood as an ordinary likelihood to find the Maximum Likelihood Estimation.

Since there’s no analytic solution for the MLE, iterative methods such as Newton–Raphson method is used to estimate the coefficient $β$ .

The hazard ratio $exp (\hat{β}_{j})$ represents the relative change in Hazard Rate for a one-unit increase in the covariate $x_{j}$ .

Goodness-of-Fit Test

For testing the null hypothesis $H_{0} : β = 0$ , Cox suggested the Rao Test. $(\frac{\partial l n L _{C} ( 0 )}{\partial β})^{⊺} (- \frac{\partial ^{2} l n L _{C} ( 0 )}{\partial β ^{2}})^{- 1} (\frac{\partial l n L _{C} ( 0 )}{\partial β}) \sim a χ^{2} (p)$

Asymptotic Normality of MLE

$\hat{β} \to D N_{p} (β, I^{- 1} (β))$ where $I (β)$ is the observed Fisher Information

Estimation of Survival Function

Under the Cox proportional hazards model, $S (t; x) = S_{0} (t)^{e x p (x^{⊺} β)}$ To estimate $S (t; x)$ , we can use $\hat{β}$ for $β$ but we still need to estimate $S_{0} (t)$ , $λ_{0} (t)$ , or $Λ_{0} (t)$ .

Breslow suggested the estimators of $λ_{0} (t)$ and $S_{0, B} (t)$ as $\hat{λ}_{0, B} = \frac{1}{( y _{u_{i}} - y _{u_{i - 1}} ) j \in R _{i} \sum e x p ( x _{j}^{⊺} β ^ )}$ If $Y_{u_{i - 1}} < t < Y_{u_{i}}$

$\hat{S}_{0, B} (t) = \prod_{i : y_{i} \leq t} (1 - \frac{δ _{i}}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β ^ )})$ If $β = 0$ , then $\hat{S}_{0, B}$ is the Kaplan-Meier Estimator

It has a few drawbacks

$\hat{S}_{0, B} (t) \neq = exp (- \hat{Λ}_{0} (t))$

$\hat{S}_{0, B} (t)$ can take negative values.

Tsiatis suggested a non-negative version of $\hat{S}_{0, B} (t)$ $\hat{S}_{0, T} (t) = exp (- \hat{Λ}_{0, T} (t))$ where $\hat{Λ}_{0, T} (t) = y_{i} \leq t \sum \frac{δ _{i}}{j \in R _{i} \sum exp ( x _{j}^{⊺} β ^ )}$

Link suggested using the linear smooth of $\hat{Λ}_{0, T} (t)$ .

Discrete on Grouped Data

When data is discrete or grouped, there are ties at each failure. Denote the ordered discrete failure time by $Y_{1} < Y_{2} < \dots < Y_{r}$ and let $R_{i}$ be the risk set at $Y_{i}^{-}$ , $D_{i}$ be the death set at $Y_{i}$ , and $d_{i} = # (D_{i})$ .

Cox suggested combining the all possible permutations. However, it is computationally infeasible. $L_{C} = \prod_{i = 1}^{r} [\frac{j \in D _{i} \prod ψ _{j}}{D _{i}^{*} \sum j \in D _{i}^{*} \prod ψ _{j}}]$ where $ψ_{j} = exp (x_{j}^{⊺} β)$ , and $D_{i}^{*}$ is the size $d_{i}$ subset of $R_{i}$

Peto suggested an alternative likelihood that instead of all possible permutations, use the same contribution. $L_{C} = \prod_{i = 1}^{r} [\frac{j \in D _{i} \prod ψ _{j}}{( j \in R _{i} \sum ψ _{j} ) ^{d_{i}}}]$

Time Dependent Covariates

In the case, the covariate depends on time. We observe $x_{i} (t)$ and the conditional likelihood defined as $L_{C} (β) = \prod_{i : I_{u}} \frac{e x p ( x _{i} ( y _{i} ) ^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j} ( y _{i} ) ^{⊺} β )}$ where $I_{u}$ is the indicator set for uncensored samples, and $R_{i}$ is the risk set.

Facts

Any two individuals have hazard functions that are constant multiples of the one another.

The Survival Function of the Cox proportional hazard model is a family of Lehmann alternatives. $\forall S \in S, \exists S_{0} \in S s.t. S = S_{0}^{γ}$ where $γ \in R^{+}$ .

If $p = 1$ , $x_{i} = I (i \in I_{1})$ , where $I_{1}$ is the indicator set for sample 1, and there are no ties, then Cox test is exactly equal to the Mantel-Haenszel Test.

Link to original

Linear Regression Model

Accelerated Life Model
Definition

Consider the random variable $T_{0}$ represents survival time with $x = 0$ Hazard Function with $f_{0} (t)$ , $λ_{0} (t)$ , and $S_{0} (t)$ And assume that the survival time of individual with covariate $x$ is defined as $T_{x} = T_{0} exp (x^{⊺} β)$ If $x^{⊺} β < 0$ then the covariate $x$ accelerates the time to failure. The model based on this assumption is called accelerated failure time (AFT) model.

Under AFT model $S (t : x) = S_{0} (t exp (- x^{⊺} β))$ $λ (t : x) = λ_{0} (t exp (- x^{⊺} β)) exp (- x^{⊺} β)$

Let $Y = ln T_{x}$ , then $E (Y) = E (ln T_{0}) + x^{⊺} β \equiv α + x^{⊺} β$ $Y = α + x^{⊺} β + ϵ \equiv E (ln T_{0}) + x^{⊺} β + ϵ ln T_{0} - E (ln T_{0})$ where $ϵ = ln T_{0} - E (ln T_{0})$ .

Assume that $ϵ = σW$ , where $W$ is a Random Variable represents error term. Then, the AFT model becomes $Y \equiv ln T_{x} = α + x^{⊺} β + σW$

Relationship between $Y = ln T_{x}$ and $W$ . $F_{T} (t) = P_{W} (\frac{l n t - μ}{σ})$ $f_{T} (t) = \frac{1}{σ t} f_{W} (\frac{l n t - μ}{σ})$ where $μ = α + x^{⊺} β$

If $W \sim N (0, 1)$ then $ln T \sim N (μ, σ^{2})$ , and if $W \sim Gumbel (0, 1)$ , then $T \sim Exp (λ)$ where $λ = exp [- (α + x^{⊺} β)]$ .
Link to original

Miller Estimator
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator

Suppose a Simple Linear Regression model $Y_{i} = α + β X_{i} + ϵ_{i}, i = 1, 2, \dots, n$

With no censoring present, the least squares estimators of the parameters $α, β$ are obtained by minimizing $\frac{1}{n} i = 1 \sum n ϵ_{i}^{2} = \frac{1}{n} i = 1 \sum n (T_{i} - α - β x_{i})^{2} = \int_{- \infty}^{\infty} z^{2} d F_{n} (z)$ where $F_{n} (z) := \frac{1}{n} i = 1 \sum n I (ϵ_{i} \leq z)$ is the Empirical Distribution Function of $z_{1}, z_{2}, \dots, z_{n}$ where $z_{i} = y_{i} - α - β x_{i}$ .

With censoring present, Miller proposed to minimize $\int_{- \infty}^{\infty} z^{2} d \hat{F}_{n} (z) = i = 1 \sum n \overset{w}{^}_{i} (β) (Y_{i} - α - β x_{i})^{2}$ where $\hat{F}$ is the Kaplan-Meier Estimator based on $(z_{i}, δ_{i})$ and the weights $\overset{w}{^}_{i} (β)$ is its jump size.

If the last observation is censored, then $i = 1 \sum n \overset{w}{^}_{i} (β) < 1$ . Hence, change the last observation to be uncensored, so that $i = 1 \sum n \overset{w}{^}_{i} (β) = 1$ .
Link to original

Buckley-James Estimator
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator

If we can observe the true survival time $T_{i}$ , we can make a model $E (T_{i}) = α + β x_{i}$ However, we can’t observe $T_{i}$ , but only censored $Y_{i}$ , and $E (Y_{i}) \neq = α + β x_{i}$ Buckley and James proposed an Unbiased Estimator for $α + β x_{i}$ $Y_{i}^{*} = Y_{i} δ_{i} + E (T_{i} ∣ T_{i} > Y_{i}) (1 - δ_{i})$ Since we also can not observe $y_{i}^{*}$ , we estimate it again. $\overset{y}{^}_{i}^{*} = y_{i} δ_{i} + [\hat{β} x_{i} + \frac{k : z ^ _{k} > z ^ _{i} \sum w ^ _{k} ( β ^ ) z ^ _{k}}{1 - F ^ ( z ^ _{i} )}] (1 - δ_{i})$ where $\overset{z}{^}_{i} = y_{i} - \hat{β} x_{i}$ , $\hat{F}$ is the Kaplan-Meier Estimator based on $(\overset{z}{^}_{i}, δ_{i})$ and the weights $\overset{w}{^}_{i} (\hat{β})$ is its jump size.

The variance of the estimator is estimated by $\hat{Var} (\hat{β}) = \frac{σ ^ _{u}^{2}}{i : δ _{i} = 1 \sum ( x _{i} - x ˉ _{u} ) ^{2}}$ where $\overset{σ}{^}_{u}^{2} = \frac{1}{n _{u} - 2} i : δ_{i} = 1 \sum [y_{i} - \overset{y}{ˉ}_{u} - \hat{β} (x_{i} - \overset{x}{ˉ}_{u})]^{2}$ , $n_{u} = i = 1 \sum n δ_{i}$ , $\overset{x}{ˉ}_{u} = i : δ_{i} = 1 \sum x_{i}$ , and $\overset{y}{ˉ}_{u} = i : δ_{i} = 1 \sum y_{i}$
Link to original

My Knowledge Base

Explorer

Regression Analysis Note

Matrix Theory

Matrix Theory Note

Matrix Theory

Basic Theories

Matrix

Definition

Operations

Addition, Subtraction

Scalar Multiplication

Transpose

Matrix Multiplication

Vector-Matrix Multiplication

Trace

Definition

Facts

Determinant

Definition

Computation

2×2 matrix

n×n matrix

Laplace Expansion

Definition

Along the jth column gives

Along the ith row gives

Facts

Idempotent Matrix

Definition

Facts

Inverse Matrix

Rank of Matrix

Definition

Facts

Inverse Matrix

Definition

Computation

2 x 2 matrix

Using Cofactor

Sherman–Morrison Formula

Definition

Proof

Examples

Updating Fitted Least Square Estimator

Facts

Matrix Inversion Lemma

Definition

Proof

Moore-Penrose Inverse

Definition

Facts

Footnotes

Partitioned Matrix

Block Matrix

Definition

Operations

Addition, Subtraction

Scalar Multiplication

Matrix Multiplication

Determinant

Eigenvalues and Eigenvectors

Eigendecomposition

Definition

Characteristic Polynomial

Eigenspace

Algebraic Multiplicity

Geometric Multiplicity

Computation

Facts

Quadratic Forms and Positive Definite Matrix

Quadratic Form

Definition

Matrix Expressions

Facts

Positive-Definite Matrix

Definition

Facts

Projection and Decomposition of Matrix

Projection Matrix

$2 \times 2$ matrix

$n \times n$ matrix

Along the $j$ th column gives

Along the $i$ th row gives