Matrix Theory

Matrix Theory Note

Matrix Theory

Basic Theories

Matrix

Definition

Operations

Addition, Subtraction

Scalar Multiplication

Transpose

,

Trace

Determinant

Matrix Multiplication

where ,

Vector-Matrix Multiplication

Link to original

Trace

Definition

The sum of elements on the main diagonal

Facts

^[Cyclic property]

The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix only has term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of of is always . Also, since are the solution of the Characteristic Polynomial, the expression is factorized as and the coefficient of become a Therefore, and .

Link to original

Link to original

Determinant

Definition

Computation

matrix

matrix

Laplace Expansion

Definition

Let be a cofactor matrix of

Along the th column gives

Along the th row gives

Link to original

Facts

The volume of box^[parallelepiped] is expressed by a determinant

changes sign, when two rows are exchanged (permutated)

depends on linearly on the 1st row.

If two row or column vectors in are equal, then

Gauss Elimination does not change

If a matrix has zero row or column vectors, then

If a matrix is diagonal or triangular, then the determinant of is the product of diagonal elements where is the element of the matrix .

The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Link to original

where the dimensions of the matrices are and respectively.

Link to original

Idempotent Matrix

Definition

A matrix whose squared matrix is itself.

Facts

Eigenvalues of idempotent matrix is either or .

Let be a symmetric matrix. Then, has eigenvalues equal to and the rest zero .

Let be an idempotent matrix, then

If is idempotent, then also idempotent.

Link to original

Inverse Matrix

Rank of Matrix

Definition

The rank of matrix is the number of linearly independent column vectors, or the number of non-zero pivots in Gauss Elimination

Facts

For the non-singular matrices and ,

Rank–Nullity Theorem

If a matrix is Symmetric Matrix, then is the number of non-zero eigenvalues.

Link to original

Inverse Matrix

Definition

satisfies for given matrix

Computation

2 x 2 matrix

where the matrix , and is a Determinant of the matrix

Using Cofactor

where the matrix is the cofactor matrix, which is formed by all the cofactors of a given matrix . Then, by the definition of inverse matrix,

Link to original

Sherman–Morrison Formula

Definition

Suppose is an invertible matrix an d . Then, is invertible . In this case,

Proof

Multiplying to the RHS gives

Since is a scalar, The numerator of the last term can be expressed as

So,

Examples

Updating Fitted Least Square Estimator

Sherman–Morrison formula can be used for updating fitted Least Square estimator

Let be the least square estimator of and the matrices with a new data be and Then,

Let for convenience Then, by Sherman-Morrison formula

So, where by expansion

So, can be obtained without additional inverse matrix calculation.

Facts

Sherman–Morrison formula is a special case of the Woodbury Formula

Link to original

Matrix Inversion Lemma

Definition

where the size of matrices are is , is , and is and should be invertible.

The inverse of a rank-k correction of some matrix can be computed by doing a rank-k correction to the inverse of the original matrix

Proof

Start by the matrix

The decomposition of the original matrix become Inverting both sides gives^[Diagonalizable Matrix]

The decomposition of the original matrix become Again inverting both sides,

The first-row first-column entry of RHS of (1) and (2) above gives the Woodbury formula

Link to original

Moore-Penrose Inverse

Definition

For a matrix , the Moore-Penrose inverse of is a matrix is satisfying the following conditions

  • 1
  • 1

Facts

The pseudoinverse is defined and unique for all matrices whose entries are real and complex numbers.

If the matrix is a square matrix, then is also a square matrix and

Footnotes

  1. Hermitian Matrix 2

Link to original

Partitioned Matrix

Block Matrix

Definition

A matrix that is interpreted as having been broken into sections called blocks or sub-matrices

Operations

Addition, Subtraction

Scalar Multiplication

Matrix Multiplication

Determinant

Let or be invertible matrices

Link to original

Eigenvalues and Eigenvectors

Eigendecomposition

Definition

where is a matrix and

If is a scalar multiple of non-zero vector , then is the eigenvalue and is the eigenvector.

Characteristic Polynomial

The values satisfying the characteristic polynomial, are the eigenvalues of the matrix

Eigenspace

The set of all eigenvectors of corresponding to the same eigenvalue, together with the zero vector.

The Kernel of the matrix

Eigenvector: non-zero vector in the eigenspace

Algebraic Multiplicity

Let be an eigenvalue of an matrix . The algebraic multiplicity of the eigenvalue is its multiplicity as a root of a Characteristic Polynomial, that is, the largest k such that

Geometric Multiplicity

Let be an eigenvalue of an matrix . The geometric multiplicity of the eigenvalue is the dimension of the Eigenspace associated with the eigenvalue.

Computation

  1. Find the solution^[eigenvalues] of the Characteristic Polynomial.
  2. Find the solution^[eigenvectors] of the Under-Constrained System using the found eigenvalue.

Facts

There exists at least one eigenvector corresponding to the eigenvalue

Eigenvectors corresponding to different eigenvalues are always linearly independent.

When is a normal or real Symmetric Matrix, the eigendecomposition is called Spectral Decomposition

The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix only has term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of of is always . Also, since are the solution of the Characteristic Polynomial, the expression is factorized as and the coefficient of become a Therefore, and .

The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Not all matrices have linearly independent eigenvectors.

When holds,

  • the eigenvalues of are and the eigenvectors of the matrix are the same as .
  • the eigenvalues of is
  • the eigenvalues of is
  • the eigenvalues of is
  • the eigenvalues of is

An eigenvalue’s Geometric Multiplicity cannot exceed its Algebraic Multiplicity

the matrix is diagonalizable the Geometric Multiplicity is equal to the Algebraic Multiplicity for all eigenvalues

Let be a symmetric matrix. Then, has eigenvalues equal to and the rest zero .

Link to original

The non-zero eigenvalues of are the same as those of .

Link to original

Quadratic Forms and Positive Definite Matrix

Quadratic Form

Definition

where is a Symmetric Matrix

A mapping where is a Module on Commutative Ring that has the following properties.

Matrix Expressions

Facts

Let be a Random Vector and be a symmetric matrix of constants. If and , then the expectation of the quadratic form is

Let be a Random Vector and be a symmetric matrix of constants. If and , , and , where , then the variance of the quadratic form is where is the column vector of diagonal elements of .

If and ‘s are independent, then If and ‘s are independent, then

Let , , where is a Symmetric Matrix and , then the MGF of is where ‘s are non-zero eigenvalue of

Let , where is Positive-Definite Matrix, then

Let , , where is a Symmetric Matrix and , then where

Let , , where are symmetric matrices, then are independent if and only if

Let , where are quadratic forms in Random Sample from If and is non-negative, then

  • are independent

Let , , where , where , then

Link to original

Positive-Definite Matrix

Definition

Matrix , in which is positive for every non-zero column vector is a positive-definite matrix

Facts

Let , then

  • where is a matrix.
  • The diagonal elements of are positive.
  • For a Symmetric Matrix , for sufficiently small

where is a non-singular matrix.

all the leading minor determinants of are positive.

Link to original

Projection and Decomposition of Matrix

Projection Matrix

Definition

For some vector , is the projection of onto

Facts

The projection matrix is symmetric Idempotent Matrix

Consider a Symmetric Matrix , then is idempotent with rank if and only if eigenvalues are and eigenvalues are .

The projection matrix is Positive Semi-Definite Matrix

If and are projection matrices, and is Positive Semi-Definite Matrix, then

  • is a projection matrix.
Link to original

QR Decomposition

Definition

Decomposition of matrix into a product of an orthonormal matrix and an upper triangular matrix .

Computation

Using the Gram Schmidt Orthonormalization

QR decomposition can be computed by Gram Schmidt Orthonormalization. Where is the matrix of orthonormal column vectors obtained by the orthonormalization and

Link to original

Cholesky Decomposition

Definition

Decomposition of a Positive-Definite Matrix into the product of lower triangular matrix and its Conjugate Transpose.

Computation

Let be Positive-Definite Matrix. Then, By setting the first-row first-column entry , can find other entries using substitution. By summarizing, ,

Link to original

Spectral Theorem

Definition

where is a Unitary Matrix, and

A matrix is a Hermitian Matrix if and only if is Unitary Diagonalizable Matrix.

Facts

Every Hermitian Matrix is diagonalizable, and has real-valued Eigenvalues and orthonormal eigenvector matrix.

For the Hermitian Matrix , the every eigenvalue is real.

Proof For the eigendecomposition , . Since is hermitian, and norm is always real. Therefore, every eigenvalue is real.

For the Hermitian Matrix , the eigenvectors from different eigenvalues are orthogonal.

Proof Let the eigendecomposition where . By the property of hermitian matrix, and So, . Since , . Therefore, are orthogonal.

Link to original

Singular Value Decomposition

Definition

An arbitrary matrix can be decomposed to .

For where and and is Diagonal Matrix

For where and and is Diagonal Matrix

Calculation

For matrix

is the matrix of orthonormal eigenvectors of is the matrix of orthonormal eigenvectors of is the Diagonal Matrix made of the square roots of the non-zero eigenvalues of and sorted in descending order.

If the eigendecomposition is , then the eigenvalues and the eigenvectors are orthonormal by Spectral Theorem. If the , then . Where are called the singular values.

Now, the Orthonormal Matrix is calculated using the singular values. Where is the Orthonormal Matrix of eigenvectors corresponding to the non-zero eigenvalues and is the Orthonormal Matrix of eigenvectors corresponding to zero eigenvalues where each , and is a rectangular Diagonal Matrix.

Since is an orthonormal matrix and is a Diagonal Matrix, . Now, the Orthonormal Matrix is calculated using the linear system and the null space of , Where is the Orthonormal Matrix of the vectors obtained from the system equation And is the Orthonormal Matrix of the vectors . which is corresponding to zero eigenvalues

The Orthonormal Matrix can also be formed by the eigenvectors of similarly to calculating of .

Facts

are the Unitary Matrix

Let be a real symmetric Positive Semi-Definite Matrix Then, the Eigendecomposition(Spectral Decomposition) of and the singular value decomposition of are equal.

where and are non-negative and the shape of the every matrix is

Visualization

every matrix can be decomposed as a

  • : rotation and reflection
  • : scaling
  • : rotation and reflection
Link to original

Miscellanea in Matrix

Centering Matrix

Definition

where is matrix where all elements are

Summing Vector

Facts

where is a sample variance of

Link to original

Derivatives of Matrix

Definition

Differentiation by vector

where

Differentiation by Matrix

where

Calculation

Let be a vector function, be an -dimensional vector, then By the definition of derivative, we hold where

Examples

Let be -dimensional vectors, then

Let be an -dimensional vector and be an matrix, then

Let be an -dimensional vector and be an Symmetric Matrix, then the derivative of quadratic form is

Link to original

Hessian Matrix

Definition

A square matrix of second-order partial derivative of a scalar-valued function

Link to original

Vectorization

Definition

Let be an matrix, then

A linear transformation which converts the matrix into a vector.

Facts

Let be matrices, then

Link to original

Kronecker Product

Definition

Let be an matrix and be a matrix, then the Kronecker product is the block matrix

Facts

Let be matrices, then

, where are square matrices Eigenvalues of is the product of the eigenvalue of and the eigenvalue of

Link to original

Norms

Norm

Definition

A real-valued function with the following properties

  • Positive definiteness:
  • Absolute Homogeneity:
  • Sub additivity or Triange inequality: where on a vector space

Vector

Real

Let be an -dimensional vector, then the norm of the is defined as

Complex

Let be an -dimensional complex vector, then the norm of the is defined as

Function

: norm of

Facts

a norm can be induced by a Inner product

Link to original

Schatten Norm

Definition

Let be an matrix, then the Schatten norm of is defined as where is the -th singularvalue of

Facts

: Nuclear Norm

: Frobenius Norm : Spectral Norm

Link to original

Frobenius Norm

Definition

Let be an matrix, then the Frobenius norm of is defined as

Link to original

Matrix p-Norm

Definition

Let be an matrix, then the matrix p-norm of is defined as

Facts

When , the norm is a Spectral Norm.

Link to original

Spectral Norm

Definition

Let be an matrix, then the matrix p-norm of is defined as

Link to original

Nuclear Norm

Definition

Let be an matrix, then the nuclear norm of is defined as

Link to original

L-pq Norm

Definition

norm is an entry-wise matrix norm. where

Facts

When , the norm is the sum of the absolute values of every entry and is called a matrix norm.

Link to original

Rayleigh-Ritz Theorem

Definition

Let be an Hermitian Matrix with eigenvalues sorted in descending order , where , then

Link to original

Link to original

Elementary Statistical Theory

Random Variable and Random Sampling

Random Variable

Definition

A random variable is a function whose inverse function is -measurable for the two measurable spaces and .

The inverse image of an arbitrary Borel set of Codomain is an element of sigma field .

Notations

Consider a probability space

  • Outcomes:
  • Set of outcomes (Sample space):
  • Events:
  • Set of events (Sigma-Field):
  • Probabilities:
  • Random variable:

For a random variable on a Probability Space

For a random variable on Probability Space and another random variable on Probability Space

Link to original

Expected Value

Definition

The expected value of the Random Variable on Probability Space

Continuous

where is a PDF of Random Variable

The expected value of the Random Variable when satisfies absolute continuous over ,

Discrete

where is a PMF of Random Variable

The expected value of the Random Variable when satisfies absolute continuous over , In other words, has at most countable jumps.

Expected Value of a Function

Continuous

Discrete

Properties

Linearity

Random Variables

If , then

Matrix of Random Variables

Let be a matrices of random variables, be matrices of constants, and a matrix of constant. Then,

Notations

ExpressionDiscreteContinuous
Expression for the event and the probability
Expression for the measurable space and the distribution
Link to original

Multivariate Distribution

Definition

Joint CDF

The Distribution Function of a Random Vector is defined as where

Joint PDF

The joint probability Distribution Function of discrete Random Vector

The joint probability Distribution Function of continuous Random Vector

Expected Value of a Multivariate Function

Continuous

E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} g(\mathbf{x})f(\mathbf{x})dx_{1} \dots \dots dx_{n}

Discrete

Marginal Distribution of a Multivariate Function

Marginal CDF

Marginal PDF

f_{X_{1}}(x_{1}) = E[g(\mathbf{X})] = \idotsint\limits_{x_{n} \dots x_{1}} f(x_{2}, \dots, x_n)dx_{2} \dots \dots dx_{n}

Conditional Distribution of a Multivariate Function

Properties

Linearity

If , then

Link to original

Covariance

Definition

Properties

Covariance with Itself

Covariance of Linear Combinations

Link to original

Transclude of Density-Function

Distribution Function

Definition

A distribution function is a function for the Random Variable on Probability Space

Facts

Proof By definition, So, defining is the Equivalence Relation to defining

Since is a Pi-System, is uniquely determined by by Extension from Pi-System

Now, is a Measurable Space induced by . Therefore, defining on is equivalent to the defining on

Distribution function(CDF) has the following properties

  • Monotonic increasing:
  • Right-continuous:

If a function satisfies the following properties, then is a distribution function(CDF) of some Random Variable

  • Monotonic increasing:
  • Right-continuous:

Link to original

Bernoulli Distribution

Definition

where is a probability of success

number of success in a single trial with success probability

Bernoulli Process

The i.i.d. Random Vector with Bernoulli distribution

Properties

PDF

Mean

Variance

Link to original

Binomial Distribution

Definition

where is the length of bernoulli process, and is a probability of success

The number of successes in length bernoulli process with success probability

Properties

PDF

Mean

Variance

MGF

Facts

Let be independent random variables following binomial distribution, then

Link to original

Multinomial Distribution

Definition

where is the number of trials, is a probability of category

Distribution that describes the probability of observing a specific combination of outcomes

Properties

PDF

where , , and

MGF

Marginal PDF

Each one-variable marginal pdf is Binomial Distribution, each two-variables marginal pdf is Trinomial Distribution, and so on.

Link to original

Poission Distribution

Definition

where is the average number of occurrences in a fixed interval of time

The number of occurrences in a fixed interval of time with mean

Properties

PDF

where

Mean

Variance

MGF

Summation

Let , and ‘s are independent, then

Link to original

Normal Distribution

Definition

where is the location parameter(mean), and is the scale parameter(variance)

Standard Normal Distribution

Properties

PDF

Mean

Variance

MGF

Higher Order Moments

Sum of Normally Distributed Random Variables

Let be independent random variables following normal distribution, then

Relationship with Chi-squared Distribution

Let be a standard normal distribution, then

Facts

Let , then

Link to original

Chi-squared Distribution

Definition

where is the degrees of freedom

squared sum of independent standard normal distributions

Properties

PDF

Mean

Variance

MGF

Additivity

Let ‘s are independent chi-squared distributions

Facts

Let and , and is independent of , then

Let , where are quadratic forms in , where each element of the is a Random Sample from If , then

  • are independent
Link to original

Student's t-Distribution

Definition

Let be a standard normal distribution, be a Chi-squared Distribution, and be independent, then where is the degrees of freedom

Properties

PDF

Mean

Variance

Link to original

F-Distribution

Definition

Let be independent random variables following Chi-squared distributions, then where are the degrees of freedoms

Properties

PDF

{\Gamma(\frac{r_{1} + r_{2}}{2}) (\frac{r_{1}}{r_{2}})^{\frac{r_{1}}{2}} x^{\frac{r_{1}}{2}-1}} {\Gamma(\frac{r_{1}}{2}) \Gamma(\frac{r_{2}}{2}) (\frac{r_{1}}{r_{2}}x+1)^{(r_{1} + r_{2})/2}}$$ ## Mean $$E(X) = \frac{r_{2}}{r_{2} - 2},\quad r_{2}>2$$ ## Variance $$\operatorname{Var}(X) = \frac{2(r_{1}+r_{2}-2)}{r_{1} (r_{2}-2)^{2}(r_{2}-4)},\quad r_{2}>4$$Link to original

Random Vector

Definition

Column vector whose components are random variables .

Link to original

Multivariate Normal Distribution

Definition

where is the number of dimensions, is the vector of location parameters, and is the vector of scale parameters

Standard Multivariate Normal Distribution

PDF

MGF

Properties

PDF

Mean

Variance

MGF

Affine Transformation

Let be a Random Variable following multivariate normal distribution, be a matrix, and be a dimensional vector, then

Relationship with Chi-squared Distribution

Suppose be a Random Variable following multivariate normal distribution, then

Facts

Let , be an , and be a -dimensional vector, then

Let , , , , where is and is vectors. Then, and are independent

Let , , and , where is , is matrices. and are independent

Link to original

Statistical Estimation and Testing

Bias of an Estimator

Definition

Let be a Random Sample from where is a parameter, and be a Statistic.

is unbiased estimator An estimator is unbiased if its bias is equal to zero for all values of parameter .

Link to original

Consistency

Definition

A Statistic is called consistent estimator of if converges in probability to

Link to original

Minimum Variance Unbiased Estimator

Definition

An estimator satisfying the following is the minimum variance unbiased estimator (MVUE) for where is an unbiased estimator

An Unbiased Estimator that has lower variance than any other unbiased estimator for the parameter

Facts

A minimum variance unbiased estimator does not always exist.

If some unbiased estimator’s variance is equal to the Rao-Cramer Lower Bound for all , then it is a minimum variance unbiased estimator.

Link to original

Least Squares Estimator

Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model The least square estimator is the estimator that minimizes

The least square estimator of the model is where and

Estimation of

where ‘s are residuals

Multiple Linear Regression Case

Consider a Multiple Linear Regression model The least square estimator is the estimator that minimizes

The least square estimator of the model is

Fitted response vector is expressed as where is called the hat matrix.

Estimation of

where , ‘s are residuals, and is the number of the explanatory variables

Facts

The least square estimator is a linear combination of ‘s Let , then

The fitted line always go through

Let , where ‘s are independent, , and be the 2nd, 3rd, and 4th Central Moment of respectively. Then, is the unique non-negative quadratic Unbiased Estimator of with minimum variance when the excess kurtosis is or when the diagonal elements of the hat matrix are equal.

Link to original

Maximum Likelihood Estimation

Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let be Random Sample with PDF , where , then the MLE of is estimated as

Regularity Conditions

  • R0: The pdfs are distinct, i.e.
  • R1: The pdfs have same supports
  • R2: The true value is an interior point in
  • R3: The pdf is twice differentiable with respect to
  • R4:
  • R5: The pdf is three times differentiable with respect to , , and interior point

Properties

Functional Invariance

If is the MLE for , then is the MLE of

Consistency

Under R0 ~ R2 Regularity Conditions, let be a true parameter, is differentiable with respect to , then has a solution such that

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let be Random Sample with PDF , where , be a consistent Sequence of solutions of MLE equation , and , then where is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, Thus, confidence interval of for is

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let be a continuous function and , then

Facts

Under R0 and R1 regularity conditions, let be a true parameter, then

Link to original

Taylor Approximation

Taylor Series

Definition

Taylor Series for One Variable

Let be an infinitely differentiable function at the point , then

Taylor Series for Two Variables

Let be an infinitely differentiable function at the point , then

&+ \frac{\partial f}{\partial t}(t_{0}, x_{0})(t-t_{0}) + \frac{\partial f}{\partial x}(t_{0}, x_{0})(x-x_{0})\\ &+ \frac{1}{2}\frac{\partial^{2} f}{\partial t^{2}}(t_{0}, x_{0})(t-t_{0})^{2} + \frac{1}{2}\frac{\partial^{2} f}{\partial x^{2}}(t_{0}, x_{0})(x-x_{0})^{2} + \frac{\partial^{2} f}{\partial t \partial x}(t_{0}, x_{0})(x-x_{0})(t-t_{0})\\ &+ \dots \end{aligned}$$ ## Approximation Let $f: \mathbb{R} \to \mathbb{R}$ be a $k$-the differentiable function at the point $c \in \mathbb{R}$ and $|x-c| \to 0$, then $$f(x) = \sum\limits_{n=0}^{k} \cfrac{f^{(n)}(c)}{n!}(x-c)^{n} + o(|x-c|^{k})$$ where $o$ is [[Big O Notation|little oh]] notation Let $f: \mathbb{R} \to \mathbb{R}$ be a $k+1$-the differentiable function at any point and $|f^{(k+1)}(x)| \leq M \leq \infty$, then by the [[Rolle's Theorem]] $$f(x) = \sum\limits_{n=0}^{k} \cfrac{f^{(n)}(c)}{n!}(x-c)^{n} + \frac{1}{(k+1)!}f^{(k+1)}(\xi)(x-c)^{k+1}$$ where $\xi \in (x, c)$ ## Maclaurin Series $$f(x) = \sum\limits_{n=0}^{\infty} \cfrac{f^{(n)}(0)}{n!}x^{n} = f(0)+f'(0)x + \cfrac{f''(0)}{2!}x^{2}+ \dots +\cfrac{f^{(m)}(0)}{m!}x^m + \dots$$ Taylor Series with $c=0$ ## Derivation from the [[Fundamental Theorem of Calculus]] Let $f$ be an infinitely [[Differentiability|differentiable]] function. By applying [[Fundamental Theorem of Calculus|FTC]] $n$ times, we can expande the function $f(x)$ as follows: $$\begin{aligned} f(x) &= \int\limits_{c}^{x} f'(x_{1}) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(x_{2}) dx_{2} + f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(x_{2}) dx_{2} dx_{1} + \int\limits_{c}^{x} f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\int\limits_{c}^{x_{2}} f'''(x_{3}) dx_{3} dx_{2} dx_{1} + \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} f''(c) dx_{2} dx_{1} + \int\limits_{c}^{x} f'(c) dx_{1} + f(c)\\ &= \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{n-1}} f^{(n)}(x_{n}) dx_{n} dx_{n-1} \dots dx_{1} + \sum_{i=0}^{n} \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} f^{(i)}(c) dx_{i} dx_{i-1} \dots dx_{1} \end{aligned}$$ The integrals inside the summation, can be simplified $$\begin{aligned} \int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} f^{(i)}(c) dx_{i} dx_{i-1} \dots dx_{1} &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-1}} 1 dx_{i} dx_{i-1} \dots dx_{1} \\ &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-2}} (x_{i-1} - c) dx_{i-1} dx_{i-2} \dots dx_{1} \\ &= f^{(i)}(c)\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{i-3}} \frac{1}{2}((x_{i-2} - c)^2 - (c - c)^2) dx_{i-2} dx_{i-3} \dots dx_{1} \\ &\dots \\ &= f^{(i)}(c)\frac{(x - c)^i}{i!} \end{aligned}$$ Therefore, we have $$f(x) = \underbrace{\int\limits_{c}^{x} \!\int\limits_{c}^{x_{1}} \!\dots \!\int\limits_{c}^{x_{n-1}} f^{(n)}(x_{n}) dx_{n} dx_{n-1} \dots dx_{1}}_{\text{Error term}} + \underbrace{\sum_{i=0}^{n} f^{(i)}(c)\frac{(x - c)^i}{i!}}_{\text{Polynomial terms}}$$Link to original

Delta Method

Definition

Univariate Delta Method

Let be a sequence of random variables satisfying , be a differentiable function at , and , then

Proof

By Taylor Series approximation

Where by the assumption. By the continuous mapping theorem, , which is a function of , also converges to random variable, so it is Boundedness in Probability by the property of converging random vector

Therefore, by the property of sequence of random variables bounded in probability

Examples

Estimation of the sample variance of Bernoulli Distribution

Let by CLT

by Delta method

Let , then

Therefore, the sample mean and variance follow such distributions

Visualization

  • x-axis:
  • y_axis:
  • y1: ^[variance by mean]
  • y2: ^[sample mean]
  • y3, y4: ^[sample variance that calculated by the sample mean]
  • y5: first order approximated line at

If sample size , variance of sample mean . So, sample variance can be well approximated by first-order approximation.

Link to original

Newton's Method

Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: The point such that the Taylor series is 0:

multivariate version:

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: The derivative of the quadratic approximation: The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: multivariate version:

Examples

Solution of a linear system

Solve with an MSE loss

The cost function is and its gradient and hessian are ,

Then, solution is If is invertible, is a Least Square solution.

Link to original

Multiple Testing

Family-Wise Error Rate

Definition

Consider a result of Multiple Testing

fact \ test resultdo not reject reject total
is true
is true
total
where is the number of rejected null hypothesis, and is the total number of hypotheses tested.

The family-wise error rate (FWER) is the probability of making at least one type 1 error in the family Therefore, by assuring , the probability of making one or more type 1 errors in the family is controlled at level

Link to original

Bonferroni's Method

Definition

Consider Multiple Testing problem , and let be an event of type 1 error for . Then, by the Bonferroni Inequality, Therefore, to control Family-Wise Error Rate at , we need to control

The Bonferroni method is too conservative and consequently, it is not a powerful test.

Link to original

Sidak Method

Definition

Consider Multiple Testing problem , and let be an event of type 1 error for By the definition of the type 1 error in multiple testing, Therefore, to control Family-Wise Error Rate at , Sidak method uses instead of the

Link to original

False Discovery Rate

Definition

where is the number of false discoveries and is the number of discoveries

The false discovery rate (FDR) is the expected proportion of falsely rejected we control FDR at some ,

Facts

If Family-Wise Error Rate is controlled, then FDR also be controlled.

Let be the -values of hypotheses, and is given. If , then FDR rejects hypotheses corresponding to

Link to original

Simple Linear Regression Model

Simple Linear Regression Model

Simple Linear Regression

Definition

where ‘s are i.i.d. error terms, with and

Link to original

Estimation of Regression Coefficients

Least Squares Estimator

Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model The least square estimator is the estimator that minimizes

The least square estimator of the model is where and

Estimation of

where ‘s are residuals

Multiple Linear Regression Case

Consider a Multiple Linear Regression model The least square estimator is the estimator that minimizes

The least square estimator of the model is

Fitted response vector is expressed as where is called the hat matrix.

Estimation of

where , ‘s are residuals, and is the number of the explanatory variables

Facts

The least square estimator is a linear combination of ‘s Let , then

The fitted line always go through

Let , where ‘s are independent, , and be the 2nd, 3rd, and 4th Central Moment of respectively. Then, is the unique non-negative quadratic Unbiased Estimator of with minimum variance when the excess kurtosis is or when the diagonal elements of the hat matrix are equal.

Link to original

Least Absolute Deviation Estimator

Definition

Consider a Simple Linear Regression model The least absolute deviation estimator are the estimator that minimize

Link to original

Gauss-Markov Theorem

Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model Let and , then the Least Squares Estimator has the minimum variance among the all unbiased linear estimators, i.e. BLUE

Multiple Linear Regression Case

Consider a Multiple Linear Regression model Let , then the Least Squares Estimator has the minimum variance among all the unbiased and linear estimators, i.e. BLUE

Link to original

Maximum Likelihood Estimation

Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let be Random Sample with PDF , where , then the MLE of is estimated as

Regularity Conditions

  • R0: The pdfs are distinct, i.e.
  • R1: The pdfs have same supports
  • R2: The true value is an interior point in
  • R3: The pdf is twice differentiable with respect to
  • R4:
  • R5: The pdf is three times differentiable with respect to , , and interior point

Properties

Functional Invariance

If is the MLE for , then is the MLE of

Consistency

Under R0 ~ R2 Regularity Conditions, let be a true parameter, is differentiable with respect to , then has a solution such that

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let be Random Sample with PDF , where , be a consistent Sequence of solutions of MLE equation , and , then where is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, Thus, confidence interval of for is

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let be a continuous function and , then

Facts

Under R0 and R1 regularity conditions, let be a true parameter, then

Link to original

Goodness-of-Fit of the Regression Line

Decomposition of Sum of Squares

Definition

Simple Linear Regression Case

Consider a Simple Linear Regression model The total sum of squares is expressed by the sum of the unexplained variation and the explained variation where

  • is the total sum of squares and has the degree of freedom ,
  • is the error sum of squares and has the degree of freedom , and
  • is the regression sum of squares and has the degree of freedom

Multiple Linear Regression Case

Consider a Multiple Linear Regression model

  • has the degree of freedom ,
  • has the degree of freedom , and
  • has the degree of freedom

Facts

Link to original

Coefficient of Determination

Definition

The coefficient of determination of the linear regression model is defined as

Facts

The coefficient of determination is a squared value of Pearson Correlation Coefficient

Link to original

Analysis of Variance

One-way ANOVA

One-way ANOVA is used to analyze the significance of differences of means between groups. Let an -th response of the -th group be where

We want to test the null hypothesis i.e. there is no treatment effect. where is the total number of observations, represents the mean of the -th group and represents the overall mean, of the numerator indicates between-group variance (SSB) and of the denominator indicates within-group variance (SSW)

The Likelihood Ratio Test rejects if

Link to original

Statistical Inference

Confidence Interval for the Mean in Simple Linear Regression

Consider a Simple Linear Regression model The confidence region for the mean response , when is given, is defined as where

Link to original

Prediction interval for a New Response in Linear Regression

Definition

Consider a Simple Linear Regression model The prediction interval for a new response , when is given, is defined as where

Facts

The prediction interval is always wider than the Confidence Interval.

Link to original

Multiple Linear Regression Model

Multiple Linear Regression Model

Multiple Linear Regression

Definition

where ‘s are i.i.d. error terms, with and

Matrix Notations

Let and , then the regression model can be express the model as

Let , , , then the regression model can be express as where , and

Link to original

Statistical Inference

Distribution of Regression Coefficient

Definition

Distribution of Regression Coefficient

Assume that The Least Squares Estimator also follows Normal Distribution. where is the number of explanatory variables.

Marginal Distribution of Regression Coefficient

The marginal distribution of the Multivariate Normal Distribution is univariate Normal Distribution where

Link to original

Confidence Interval for Regression Coefficient

Definition

Assume that The confidence interval for is defined as where is the Standard Error for Regression Coefficient , and is the number of explanatory variables.

Link to original

Standard Error for Regression Coefficient

Definition

Assume that The standard error for regression coefficient is calculated as where , , and is the number of explanatory variables.

Link to original

Hypothesis testing for Regression Coefficient

Definition

Assume that The null hypothesis can be tested with the test statistic where is the Standard Error for Regression Coefficient .

Link to original

Joint Confidence Region for Regression Coefficient

Definition

Assume that The join confidence region for is defined as where and is the number of explanatory variables.

The joint confidence region is ellipsoidal shape whose center is by the inequality.

Link to original

Simultaneous Confidence Interval for Regression Coefficient

Definition

Simultaneous confidence interval is used when computing confidence intervals for parameters simultaneously.

Joint confidence region provides the accurate elliptical area as the confidence region, but the calculations and interpretations are complex. To obtain a rectangular-shaped confidence interval, the simultaneous confidence interval is used. When multiplying the confidence intervals of each coefficient, the resulting confidence region becomes smaller than the desired area. Therefore, to obtain the “at least ” confidence region, correction method is used. Conservative methods like the Bonferroni’s Method, provide a confidence region much larger than our desired , which satisfies the condition of being at least but reduces the power of the test. To address this issue, other methods have been devised.

Bonferroni’s Method

Assume that The Bonferroni confidence interval for is defined as where is the Standard Error for Regression Coefficient and is the number of explanatory variables.

Scheffe Method

Assume that The Scheffe confidence interval for is defined as where is the Standard Error for Regression Coefficient and is the number of explanatory variables.

Maximum Modulus Method

Assume that The Maximum Modulus confidence interval for is defined as where is a Random Variable representing the maximum of the absolute of independent random variables following , is the Standard Error for Regression Coefficient , and is the number of explanatory variables.

Facts

Maximum Modulus method < Bonferroni’s method < Scheffe’s method Here, the second inequality holds only when the degree of freedom of comparisons is relatively small compared to the number of groups being compared.

Link to original

Confidence Interval for the Mean in Multiple Linear Regression

Consider a Multiple Linear Regression model The confidence region for the mean response , when is given, is defined as where and is the number of explanatory variables

Link to original

Partial F-Test

Definition

Consider a Multiple Linear Regression model and the hypothesis

Define the full model as and under , define the reduced model as

We reject if is large, where and are the regression sum of squares of the full and reduced model, respectively. By the equality , we can test the instead of the

and . Therefore, where and

Hence, we reject if and we call it partial F-test

Link to original

The General Linear Test

Definition

Consider a Multiple Linear Regression model The general linear test can be written as where is matrix with rank and is vector.

e.g. for the hypothesis ,

and . Therefore, where

Hence, we reject if

Method of Lagrange Multiplier

By the method of Lagrange multiplier, Therefore, we can calculate the F-statistic without fitting the reduced model

Link to original

Lack of Fit Test

Lack of Fit Test

Definition

Consider a Multiple Linear Regression model We want to test the linearity of the model

Assume that is the -th replication at , then The is decomposed to the sum of the pure error sum of squares and the lack-of-fit sum of squares .

The degree of freedoms are , , and , where

Also, the mean squares are obtained by dividing the sum of squares by its degree of freedom

Under , the is an unbiased estimator of , and is always an unbiased estimator of , regardless of . Therefore, we can test the hypothesis by comparing ratios of both estimators.

and . Therefore,

Hence, we reject if

Facts

The lack of fit test uses data’s replication, Therefore, there should be more than one data (replication) for some ‘s.

Link to original

Miscellanea

Multicollinearity

Definition

If two or more covariates are highly correlated, in a multiple regression model, the model has a multicollinearity The multicollinearity makes the estimation of coefficient very unstable. So, the estimated coefficients are not reliable.

Link to original

Regression Diagnostics

Residual

Studentized Residual

Definition

A studentized residual is a technique used to detect outliers

Internally Studentized Residual

where

Externally Studentized Residual

where is the unbiased estimator of based on observations after deleting the -th observation

Facts

Link to original

Leverage (Statistics)

Consider a Multiple Linear Regression model

is the -th leverage, the distance between and

A measure of how far away the independent variable value of an observation is from those of the other observations It is used to detect outlier

Facts

, where is the number of explanatory variables

Link to original

Influence Measures

Cook's Distance

Definition

Single Observation

Consider a Multiple Linear Regression model where is the number of explanatory variables, and is the Leverage (Statistics)

is the Cook’s distance. The more influential the data point, the larger the Cook’s distance

Set of Observations

Consider a Multiple Linear Regression model where is the number of explanatory variables, and is the matrix of leverages

Link to original

Andrews-Pregibon Statistic

Definition

Single Observation

Consider a Multiple Linear Regression model where is the Leverage (Statistics) of the matrix

is the Andrews-Pregibon statistic. The more influential the data point, the smaller the Andrews-Pregibon statistic

Set of Observations

Consider a Multiple Linear Regression model where is the matrix of leverages

Link to original

DFBETAS

Definition

The DFBEATS consider the influence of the -th observation on

Single Observation

Consider a Multiple Linear Regression model where is the externally studentized residual and is the element of the matrix

Set of Observations

Consider a Multiple Linear Regression model where is the matrix of leverages and is the element of the matrix

Link to original

DFFITS

Definition

The DFFITS consider the influence of the -th observation on

Single Observation

Consider a Multiple Linear Regression model where is the externally studentized residual

Set of Observations

Consider a Multiple Linear Regression model where is the externally studentized residual and is the matrix of leverages

Link to original

COVRATIO

Definition

The COVRATIO consider the influence of the -th observation on

Single Observation

Consider a Multiple Linear Regression model

Set of Observations

Consider a Multiple Linear Regression model

Link to original

Multicollinearity

Variance Inflation Factor

Definition

Consider a Multiple Linear Regression model where is the number of explanatory variables.

The variance inflation factor (VIF) measures how much the variance of an estimated regression coefficient is increased because of Multicollinearity

Link to original

Condition Number

Definition

Consider a matrix with singular values

is the conditional number of

Facts

In linear regression, if the conditional value of the matrix is large, where is a design matrix, the model has a Multicollinearity problem.

Link to original

Autocorrelation and Durbin-Watson Test

Durbin-Watson Test

Definition

Test used to detect the presence of autocorrelation at lag in the residual

Assume is the residual structure of where is the number of observations and

indicates no autocorrelation

Link to original

Selection of Regression Models

Adjusted R_squared Value

Definition

where is the number of explanatory variables.

Link to original

Mallow's Cp

Definition

where is the number of explanatory variables, is a sample variance under the full model, and

Mallow’s is used to assess the fit of a regression model. A small value of means that the model is relatively precise.

Link to original

PRESS statistic

Definition

where is an estimated value of calculated by the data excluding

The predicted residual error sum of squares (PRESS) is a form of cross validation used in regression analysis

Link to original

Model Validation

Cross Validation

Definition

Partition a set of data into sets and denote a function such that . For the dataset , let be the estimator based on observations except , then the cross validation estimation for the prediction error is defined as where is a loss function

Facts

If the data is partitioned into group with equal size, then it is called a -fold cross validation.

Link to original

Miscellanea in Model Selection

Jensen's Inequality

Definition

Let be a Convex Function on an interval , be a Random Variable with support , and , then

Facts

By Jensen’s Inequality, the following relation is satisfied arithmetic mean geometric mean harmonic mean

Link to original

Kullback-Leibler Divergence

Definition

Assume that two Probability Distributions and are given. Then the Kullback-Leibler divergence between and is defined as

Kullback-Leibler divergence measures how different two distributions are.

It also can be expressed as a difference between the cross entropy (difference between distributions and )and entropy (inherent uncertainty of ).

Facts

Let be a sequence of distributions. Then, The convergence of the KL-Divergence to zero implies that the JS-Divergence also converges to zero. The convergence of the JS-Divergence to zero is equivalent to the convergence of the Total Variation Distance to zero. The convergence of the Total Variation Distance to zero implies that the Wasserstein Distance also converges to zero. The convergence of the Wasserstein Distance to zero is equivalent to the Convergence in Distribution of the sequence.

Link to original

Akaike Information Criterion

Definition

where is the MLE of the postulated model’s parameter

Link to original

Bayesian Information Criterion

Definition

Facts

BIC more penalize than AIC when is large

Link to original

Transformation of the Linear Regression Model

The Use of Dummy Variables

Dummy Variable

Definition

A dummy variable is one that takes a binary variable. It is commonly used in regression analysis to represent categorical variables that have more than two categories.

Link to original

One-way ANOVA with a Regression Model

where are dummy variables represent each category, and is the number of categories

In the setting, a corner-point constraint is used.

The null hypothesis i.e. there is no treatment effect, yields same test statistic as the null hypothesis of the one-way ANOVA with equal replication

Link to original

Two-way ANOVA with a Regression Model

where and are dummy variables representing categories of the two factors, is the number of categories for the first factor, and is the number of categories for the second factor

In this setting, a corner-point constraint is used for both factors.

The null hypotheses can be tested by Deviance

Link to original

Polynomial Regression

Polynomial Regression

Definition

Consider explanatory variables and a response variable The -variables 2nd order polynomial regression model is defined as

The single variable -th order polynomial regression model is defined as

Facts

The column vectors of the design matrix in the polynomial regression are highly correlated, so proper transformation is needed. Among them, orthogonal polynomial is often used.

Link to original

Response Surface Analysis

Definition

The goal of response surface analysis is finding the optimal condition for the response. It employs Polynomial Regression model to model response surface.

Consider explanatory variables and a response variable . The second-order response surface model is defined as where , , Let the fitted model is . Then, the stationary point is obtained by and the fitted value at the stationary point is . If is Positive-Definite Matrix, then is minimum If is Negative-Definite Matrix, then is maximum If is neither Positive-Definite Matrix nor Negative-Definite Matrix, then is saddle point.

Link to original

Weighted Least Squares Method

Weighted Least Squares

Definition

Weighted least squares (WLS) is a generalization of Least Square to cope with Heteroskedasticity.

We assume that where The weight matrix is positive definite, so there exists a non-singular matrix . Consider a transformed model Now, the new error term is i.i.d. And the weighted least squares estimator is obtained by

Link to original

Box-Cox Transformation Model

Box-Cox Transformation Model

Definition

Box-Cox transformation is useful when dealing with heteroskedasticity by stabilizing variance.

Box-Cox Transformation

Box-Cox Transformation Model

The model using the Box-Cox Transformed variable as a response variable is called a Box-Cox transformation model where

Link to original

Robust Regression

M-estimator

Definition

Consider an error (loss) function , The M-estimator is obtained by where is a robust estimator, and given by

If , then it is LSE, and if , then it is -norm regression estimator. The derivative of , denoted , is called the influence function. Since is a non-liner in , we van not get explicit solution. Instead, we use an iterative method, called IRLS (Iterative Reweighted Least Squares).

Iterative Reweighted Least Squares

  1. Initialize , often obtained from Ordinary Least Squares.
  2. Calculate the weight If , then
  3. Update using Weighted Least Squares where
  4. Iterate step 1 to 3 until the estimator converges.
Link to original

Inverse Regression

Inverse Regression

Definition

The problem predicting for a given response is called inverse regression (calibration, discrimination).

Consider a simple linear regression model The point estimation of for a given is obtained by Also, the confidence interval for is obtained by the second-order inequality where and . Let be solutions for the inequality, and we have a confidence interval

Link to original

Biased Estimation

James-Stein Shrinkage Method

James-Stein Estimator

Definition

The James–Stein estimator is a biased estimator of the mean of correlated Multivariate Normal Distribution.

Let . The James–Stein estimator is defined as

The Least Squares Estimator is MLE and UMVUE. However, in terms of Mean Squared Error, James-Stein estimator is better than the least squares estimator for cases.

Link to original

Ridge Regression

Ridge Regression

Definition

where is a complexity parameter that controls the amount of shrinkage.

Ridge regression is particularly useful to mitigate the problem of Multicollinearity in linear regression

Facts

where by Singular Value Decomposition

Link to original

Principal Component Regression

Principal Component Analysis

Definition

PCA is a linear dimensionality reduction technique. The correlated variables are linearly transformed onto a new coordinate system such that the directions capturing the largest variance in the data.

Population Version

Given a random vector , we find a such that is maximized: Equivalently, by the Method of Lagrange Multipliers with , By differentiation, the is given by the eigen value problem Thus the maximizing the variance of is the eigenvector corresponding to the largest Eigenvalue.

Sample Version

Given a data matrix , by Singular Value Decomposition, A matrix can be factorized as . By algebra, , where we call the -th principal component.

Facts

Since and

Link to original

Principal Component Regression

Definition

where is a matrix whose columns are . In PCR, instead of regressing the dependent variables on explanatory variables directly, principal components(PC) of the explanatory variables are used as regressors. One typically uses only first a few PCs for regression, making PCR a kind of regularized procedure.

Facts

Since are linear combination of the original , we can express the solution in therms of coefficients of the .

Link to original

PLS: Partial Least Squares

Partial Least Squares Regression

Definition

where is the sample covariance matrix.

Unlike PCR maximizing only, PLS finds directions maximizing both and

Link to original

LASSO

Lasso Regression

Definition

Lasso model assume that the coefficients of the model are sparse.

Link to original

Bayes Estimator and Biased Estimators

Bayes Theorem

Definition

Discrete Case

where by Law of Total Probability

is called a prior probability, and is called a posterior probability

Continuous Case

where is not a constant, but an unknown parameter follows a certain distribution with a parameter .

is called a prior probability, is called a likelihood, is called an evidence or marginal likelihood, and is called a posterior probability

Parameter-Centric Notation

Examples

Consider a random variable follows Binomial Distribution and a prior distribution follows Beta Distribution where .

The PDFs are defined as and . Then, by Bayes theorem, Under Squared Error Loss, the Bayes Estimator is a mean of the posterior distribution.

Link to original

Loss Function

Definition

Let be a parameter, be a Statistic for the parameter , and be a Decision Function

A loss function is a non-negative function defined as

It indicates the difference or discrepancy between and

Examples

Link to original

Risk Function

Definition

Risk function is an expectation of Loss Function

Link to original

Bayes Risk

Definition

r(\theta, \delta) &= \int_{\Theta} R(\theta, \delta) \pi(\theta) d\theta = E_\theta[R(\theta, \delta)] = E_\theta[E_{x}[L(\theta, \delta(x))]]\\ &= \int_{\Theta} \int_{X} L(\theta, \delta(x))p(x|\theta) \pi(\theta) dx d\theta = \int_{X} \int_{\Theta} L(\theta, \delta(x))p(\theta|x)p(x) d\theta dx\\ &= \int_{X}E_\theta[L(\theta, \delta(x))|X=x]p(x)dx = \int_{X}\rho(x, \pi)p(x)dx \end{aligned}$$ where $L(\theta, \delta)$ is [[Loss Function]], $R(\theta, \delta)$ is [[Risk Function]], and $\rho(x, \pi)$ is a [[Posterior Risk]].Link to original

Bayes Estimator

Definition

Estimator that minimizes the Bayes Risk or Posterior Risk

Facts

Under Squared Error Loss, the Bayes estimator is a posterior mean, and a posterior mode under Absolute Error Loss.

Consider a regression model and the prior distribution for , . Then, the Bayes estimator under the Squared Error Loss is obtained as If for some and , then the Bayes estimator is the same as ridge estimator If and , then the Bayes estimator is the James-Stein regression estimator

Link to original

Empirical Bayes Estimator

Definition

Empirical Bayes estimator is a Bayes Estimator whose prior distribution is estimated from the data.

The parameter of the prior distribution is estimated by Maximum Likelihood Estimation. And a posterior distribution is calculated with the . Estimator is obtained using the posterior distribution.

Examples

Consider data following a Poission Distribution and a prior distribution follows a Gamma Distribution where with known, unknown.

Then the marginal likelihood is defined as

p(x_{i}|\beta) &= \int \operatorname{Pois}(x|\lambda_{i})\Gamma(\lambda_{i};\alpha, \beta) d\lambda_{i} = \int \left[ \frac{e^{-\lambda_{i}}\lambda^{x_{i}}}{x_{i}!} \right]\left[ \frac{\beta^{\alpha}\lambda_{i}^{\alpha-1}e^{-\beta \lambda_{i}}}{\Gamma(\alpha)} \right]d\lambda_{i}\\ &= \binom{x_{i}+\alpha-1}{\alpha-1}\left( \frac{\beta}{\beta+1} \right)^{\alpha} \left( \frac{1}{\beta+1} \right)^{x_{i}} \sim NB\left( \alpha, \frac{\beta}{\beta+1} \right) \end{aligned}$$ And the [[Maximum Likelihood Estimation|MLE]] of $\beta$, $\hat{\beta}_\text{MLE}$ is $$\hat{\beta}_\text{MLE} = \underset{\beta}{\operatorname{argmax}} \prod_{i=1}^{n}p(x_{i}|\beta) = \frac{\alpha}{\bar{X}}$$ The posterior distribution with $\hat{\beta}_\text{MLE}$ is defined as $$p(\lambda_{i}|x; \hat{\beta}_{\text{MLE}}) \propto p(x|\lambda_{i})\pi(\lambda_{i};\alpha, \hat{\beta}) \sim \Gamma(x_{i}+\alpha, 1+\hat{\beta})$$ Under [[Squared Error Loss]], the [[Bayes Estimator]] is a mean of the posterior distribution $\Gamma(x_{i}+\alpha, 1+\hat{\beta})$. $$\hat{\delta}_{\text{Bayes}} = \frac{\bar{X}(X_{i}+\alpha)}{\bar{X} + \alpha}$$ # Facts > Assume that $\mathbf{z} \sim N_{p}(\boldsymbol{\mu}, \mathbf{I})$ and the prior distribution for $\boldsymbol{\mu}$ is $\boldsymbol{\mu} \sim N_{p}(\mathbf{0}, \sigma^{2}\mathbf{I})$. > Then, the empirical Bayes estimator under the [[Squared Error Loss]] is [[James-Stein Estimator]] $\hat{\boldsymbol{\mu}}_{JS} = \left( 1 - \cfrac{p-2}{\mathbf{z}^{\intercal}\mathbf{z}} \right)\mathbf{z}$Link to original

Generalized Linear Model

Exponential Family

Exponential Family

Definition

General Representation

Where:

  • is the vector of natural parameters.
  • is the vector of sufficient statistics.
  • is the non-negative volume of
  • is the normalizer refers to the measure of .

Canonical Form with Dispersion Parameter

where , , and are known functions, is called a variance function, and is called a dispersion parameter. This form is mainly used for GLMs.

Examples

DistributionsNormalPoissonBinomialGammaInverse Gaussian
Notation
Natural linkIdentityloglogitinverse

Facts

For an exponential family, and holds by Bartlett Identities

Link to original

Bartlett Identities

Definition

First Bartlett Identity

where is a Likelihood Function and is a Score Function

Second Bartlett Identity

where is a Likelihood Function

Link to original

Score Function

Definition

The gradient of the log-likelihood function with respect to the parameter vector. The score indicates the steepness of the log-likelihood function

Facts

The score will vanish at a local Extremum

Link to original

Fisher Information

Definition

Fisher Information

I(\theta) &:= E\left[ \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} \right] = \int_{\mathbb{R}} \left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right)^{2} p(x, \theta)dx\\ &= -E\left[ \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} \right] = -\int_{\mathbb{R}} \frac{\partial^{2} \ln f(X|\theta)}{\partial \theta^{2}} p(x, \theta)dx\\ \end{aligned}$$ by the [[Bartlett Identities#second-bartlett-identity|second Bartlett identity]] $$I(\theta) = \operatorname{Var}\left( \frac{\partial \ln f(X|\theta)}{\partial \theta} \right) = \operatorname{Var}(s(\theta|x))$$ where $s(\theta|x)$ is a [[Score Function]] ## Fisher Information Matrix ![[Pasted image 20231224171415.png|800]] Let $\mathbf{X}$ be a [[Random Vector]] with [[Density Function|PDF]] $f(x|\boldsymbol{\theta})$, where $\boldsymbol{\theta} \in \Omega \subset R^{p}$, then the **Fisher information matrix** for on $\boldsymbol{\theta}$ is a $p \times p$ matrix defined as $$I(\boldsymbol{\theta}) := \operatorname{Cov}\left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) = E\left[ \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right) \left( \cfrac{\partial}{\partial \boldsymbol{\theta}} \ln f(x|\boldsymbol{\theta}) \right)^\intercal \right] = -E\left[ \cfrac{\partial^{2}}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\intercal} \ln f(x|\boldsymbol{\theta}) \right]$$ and $jk$-th element of $I(\boldsymbol{\theta})$, $I_{jk} = - E\left[ \cfrac{\partial^{2}}{\partial \theta_{j} \partial\theta_{k}} \ln f(x|\boldsymbol{\theta}) \right]$ # Properties ## Chain Rule The information in length $n$ [[Random Sample]] $X_{1}, X_{2}, \dots, X_{n}$ is $n$ times the information in a single sample $X_{i}$ $I_\mathbf{X}(\theta) = n I_{X_{1}}(\theta)$ # Facts > In a location model, information is not dependent on a location parameter. > $$I(\theta) = \int_{-\infty}^{\infty}\left( \frac{f'(z)}{f(z)} \right)^{2} f(z)dz$$Link to original

Construction of GLMs

Generalized Linear Model

Definition

A generalized linear model (GLM) is a generalization of linear regression model.

The GLM consists of three elements.

  1. Response variable: The i.i.d. response variables follow Exponential Family.
  2. Linear predictor: is called a linear predictor
  3. Link function: There exists a link function satisfies , where is assumed to be monotone and differentiable.

Link functions for binomial response.

For an Exponential Family, the link function satisfying is called natural link.

Facts

The parameter is estimated by MLE. Since the score function is non-linear, it can’t have explicit solution, so we use Newton–Raphson method or Fisher’s Scoring Method.

Link to original

Estimation of Regression Coefficients

Newton's Method

Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: The point such that the Taylor series is 0:

multivariate version:

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: The derivative of the quadratic approximation: The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: multivariate version:

Examples

Solution of a linear system

Solve with an MSE loss

The cost function is and its gradient and hessian are ,

Then, solution is If is invertible, is a Least Square solution.

Link to original

Fisher's Scoring Method

Definition

Fisher’s scoring method is a variation of Newton–Raphson method that uses Fisher Information instead of the Hessian Matrix

where is the Fisher Information.

Link to original

Goodness-of-Fit Measures for GLMs

Deviance

Definition

Deviance is a goodness-of-fit statistic for Generalized Linear Model.

Let be an estimator of under maximal model . The deviance is defined as where is a log-likelihood, is the number of parameter of the current model, and is the sample size.

Consider a null hypothesis : the current model is not good, can be tested with the deviation. We reject if

Examples

DistributionDeviance
Normal
Poisson
Binomial
Gamma
Inverse Gaussian
Link to original

Pearson's Chi-squared Statistic

Definition

The Pearson’s statistic is goodness-of-fit measure for GLM defined as where , is the variance function, is the number of parameter of the current model, and is the sample size.

Facts

Under the Gaussian distribution, the Deviance and the Pearson’s statistic is the same and follow Chi-squared Distribution

Link to original

Testing and Residuals

Goodness-of-Fit Test with Deviance

Definition

Assume that a current model with -parameters and consider a null hypothesis . Then, the test Statistic is defined as where are the Deviance of the generalized linear models.

We reject if

Link to original

Pearson Residual

Definition

The Pearson residual is defined as where , is the variance function.

Facts

where is the Pearson’s Chi-squared Statistic

Link to original

Anscombe Residual

Definition

Anscombe Transformation

where is the variance function.

Anscombe Residual

The Anscombe residual is the Anscombe Transformed Pearson Residual. The transformation make the residual follows a normal distribution. where , is the variance function.

Link to original

Deviance Residual

Definition

The deviance residual is defined as where , is the variance function, and the Deviance

Link to original

ANOVA Models

Constraints for Dummy Variable

Definition

When using Dummy Variable, the design matrix where is the number of groups, is not a full rank. The problem can be solved by assigning some constraints.

Sum-to-Zero Constraint

Sum-to-zero constraint uses row-wise demeaned dummy variables.

With sum-to-zero constraint, each coefficient indicates discrepancy from overall mean, and the intercept term indicates overall mean.

Corner-Point Constraint

Corner-point constraint omit a base group from the model.

With corner-point constraint, each coefficient indicates discrepancy from base group, and the intercept term indicates mean of base group.

Link to original

Deviance Test for One-way ANOVA

The null hypothesis i.e. there is no treatment effect, can be tested with the Deviance. If is known, the test statistic is defined as And reject the if

If is unknown, the test statistic is defined as And reject the if

Link to original

Deviance Test for Two-way ANOVA

For a two-way ANOVA with factors and , we have three null hypotheses to test:

  • i.e. there is no treatment effect of factor
  • i.e. there is no treatment effect of factor
  • i.e. there is no interaction effect between factor and They can be tested with the Deviance.

If is known, the test statistic for is defined as And reject the if

If is unknown, the test statistic for is defined as And reject the if

If is known, the test statistic for is defined as And reject the if

If is unknown, the test statistic for is defined as And reject the if

If is known, the test statistic for is defined as And reject the if

If is unknown, the test statistic for is defined as And reject the if

Link to original

Logistic Regression Model

Logistic Regression

Definition

Logistic regression is a Generalized Linear Model with Logit link function. Logistic regression models the log-odds of an event as a linear combination of independent variables. The probability is calculated as

Link to original

Weighted Least Squares for Binomial Distribution

Definition

Let and random variables and consider a variance stabilizing transformation . Then, by Taylor expansion and Delta Method and hold. Now, the Pearson is defined as By minimizing we can find the WLS estimator . It can be seen as a Weighted Least Squares with , , and

Widely Used Variance Stabilizing Transformations

  • Logit function:
  • The arcsin squared-root function:
  • Empirical logistic function:
Link to original

Multinomial Logistic Regression

Definition

Multinomial logistic regression is used for polychotomous (multi-class) data.

Suppose the number of classes is The probability is calculated as

Link to original

Proportional Odds Model

Definition

The proportional odds model is a Generalized Linear Model used for modeling the ordinal response.

Consider the response having possible categories. And the cumulative probability of a response is .

Link to original

Log-Linear Model

Contingency Table

Definition

contingency table

Total
Total

Contingency table is a table displays the multivariate frequency distribution of the variables.

Examples

Poisson and Multinomial Distribution Cases

2-Dimensional Case

Consider a contingency table.

No Margin Fixed

If ‘s are independent and , then the joint PDF becomes

Fixed Total

Given the total , the conditional distribution is Multinomial Distribution. where and

One Margin Fixed

Given one fixed margin (in this case, row) , the conditional distribution of frequency of -th row is Multinomial Distribution. And if each row is independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution).

3-Dimensional Case

Consider a contingency table.

No Margin Fixed

If ‘s are independent and , then the joint PDF becomes

Fixed Total

Given the total , the conditional distribution is Multinomial Distribution. where and

One Margin Fixed

Given one fixed margin (in this case, ) , and each -th rows are independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution).

Two Margin Fixed

If two margins are fixed (in this case, and ) , and each -th and -th rows are independent, the conditional distribution is product-Multinomial Distribution (joint multinomial distribution).

Link to original

Poisson Regression

Definition

Poisson Regression

Log-linear model is a Generalized Linear Model with log link function. Log-linear model models the log expected counts of an event as a linear combination of independent variables.

Log-Linear Model for Contingency Table

Consider a Contingency Table, then the model is defined as where is the overall effect, and are main effect, and are interaction effect.

The Deviance of the model is defined as and the Pearson’s Chi-squared Statistic is defined as

Link to original

Overdispersion

Overdispersion

Definition

Overdispersion is the presence of greater observed variance than what would be expected under a given statistical model. It can happen due to heterogeneity or lack of independence between trials, and clustering or grouping in the data.

Overdispersion on Binomial Distribution

Assume that there are clusters, the number of observations of each cluster is . Let the Random Variable following Binomial Distribution be the number of successes out of observation, where is also Random Variable with mean and variance . Also, let be the total number of successes in the all clusters, where . Then, the mean and variance of is defined as Here, is called a dispersion parameter. If , then overdispersion occurred.

The Lexis Ratio is also used for detecting overdispersion where and

Overdispersion on Poisson Distribution

If in Poission Distribution setting, then the overdispersion might occur. Let i.i.d. be the number of occurrences at the -th cluster, and the number of clusters is also a Random Variable. Also, let be the total number of occurrences in clusters, where follows a Poission Distribution and independent to . Then, the mean and variance of is defined as If , then the overdispersion occurs.

Link to original

Nonlinear Regression Model

Nonlinear Regression Model

Nonlinear Regression Model

Definition

Let be a response and be -dimensional explanatory variable. The non-linear regression model can be written as where is a known regression function and is a -dimensional parameter vector.

The parameter vector can not be obtained analytically. So it estimated numerically with such methods Newton–Raphson method, Gauss-Newton Method

Facts

If the function is linear, then the model is Multiple Linear Regression.

Link to original

Estimation of Parameter

Gauss-Newton Method

Definition

Gauss-Newton method is an iterative algorithm used to solve used to solve non-linear least squares problems. This method approximates Hessian Matrix using the Jacobian Matrix.

Algorithm

Consider a non-linear regression model with -dimensional explanatory variable and -dimensional parameter vector The first order Taylor expansion gives the linear approximation of the model. where is the initial vector for given by domain knowledge, is the estimation vector with , and is the Jacobian Matrix of the function at .

Now, the approximated model has a Multiple Linear Regression. Since and are constants given , we can find LSE of . And the updating formula is defined as The updating is repeated iteratively until it converges. converges to the minimizing Sum of Squared Errors Loss with a proper initial value .

Link to original

Inference in the Nonlinear Model

Distribution of Parameter of Nonlinear Regression

Definition

Suppose a Nonlinear Regression Model If the error term follows a Normal Distribution, LSE is the same as MLE.

For a large sample size, the distribution of the estimated parameter follows a Normal Distribution where is the Jacobian Matrix of and is the variance of the error term.

The variance of the error term can be estimated by

Link to original

Standard Error for Parameter of Nonlinear Regression

Definition

Suppose a Nonlinear Regression Model The standard error for parameter of nonlinear regression is calculated as where , , and is the Jacobian Matrix of .

Link to original

Joint Confidence Region for Nonlinear Regression

Definition

Suppose a Nonlinear Regression Model The join confidence region for is obtained as where , is the number of explanatory variables, and is the Jacobian Matrix of .

Link to original

Confidence Interval for Parameter of Nonlinear Regression

Definition

Suppose a Nonlinear Regression Model The confidence interval for is obtained as where is the Standard Error for Parameter of Nonlinear Regression , and is the number of explanatory variables.

Link to original

Hypothesis testing of Parameter of Nonlinear Regression

Definition

Suppose a Nonlinear Regression Model The null hypothesis can be tested with the test statistic where is the Standard Error for Parameter of Nonlinear Regression , and is the number of explanatory variables.

Link to original

Non-Parametric Regression

Kernel Estimator

K-Nearest Neighbors Algorithm

Definition

Suppose is the distance between the training sample and the given point . Let

The k-NN determines a sample’s class using the -nearest training datum.

Facts

and are hyperparameters.

Link to original

Kernel Density Estimation

Definition

where is the kernel, is the scaled kernel, and is a smoothing parameter (bandwidth, or window width)

KDE is a non-parametric method to estimate PDF of a random variable based on kernel and wrights.

Multidimensional KDE

where is the kernel, is the scaled kernel, and is a positive-definite bandwidth matrix.

Facts

Widely used kernels

Link to original

Kernel Regression

Definition

Kernel regression is an extension of Kernel Density Estimation to estimate the conditional expectation of a random variable given . The kernel regression estimator is defined as where is the kernel, is the scaled kernel, and is a smoothing parameter

Link to original

Nadaraya–Watson Kernel Regression

Definition

Nadaraya–Watson kernel regression is a Kernel Regression with the sum of the weights is . The Nadaraya–Watson kernel regression estimator is defined as where is the kernel, and is a smoothing parameter

Facts

Nadaraya–Watson kernel regression estimator is the same as Local Polynomial Regression with

Link to original

Local Polynomial Regression

Definition

Local polynomial regression fits the polynomial function with the data near the given point. The polynomial function has a form similar to Taylor Series to exploit the relation between the coefficients and the differentiation at the given point.

The -th degree polynomial function at a point is defined as where

Local polynomial regression estimator is obtained by the relation . The coefficients are estimated by an optimization problem where , is a kernel function, and is a smoothing parameter.

In a matrix notation, the optimization problem can be written as where , is an matrix, and

A solution of the problem is obtained as where is a -vector whose -th element is and the others are .

Link to original

Series Estimator

Series Estimator

Definition

Series Estimator

Suppose a non-parametric regression function where and let a sequence of functions be a complete orthonormal basis. Then, the function can be written as a series. where is a parameter subject to estimation.

The regression function can be approximated with the estimated parameter and finite sum. It is called a series estimator where is a smoothing parameter.

The series estimator can be written as a form of a kernel estimator with the estimation of the parameter . where is a kernel.

Thresholding

In general, as the is larger the smaller. So the terms of the series can be selected with a certain algorithm.

The hard-thresholding estimator is defined as where is a thresholding parameter

And the soft-thresholding estimator is defined as

Link to original

Multiresolution Analysis

Definition

Multiresolution analysis decomposes a signal into a coarse approximation and a series of details at different scales. where the first term represents the coarse approximation and the second term represents the sum of details. Here, is a scaling function (father wavelet), is a wavelet function (mother wavelet), is approximation coefficient, is a detail coefficient.

The set is an orthonormal basis of the function space.

Link to original

Wavelet Estimator

Definition

Wavelet estimator is a kind of Series Estimator using wavelet basis. where and

To find a optimal resolution Thresholding is used.

Link to original

Spline Estimator

Piecewise Polynomials

Definition

Suppose that knots are . Then, the basis functions for order-M spline are defined as: where

Examples

For cubic Polynomials with two knots , the basis functions are:

Facts

Order of piecewise polynomials

  • constant: 1
  • linear: 2
  • quadratic: 3
  • cubic: 4

Link to original

Natural Cubic Splines

Definition

Suppose that knots are . Then, the basis functions for natural cubic Spline are defined as: where

Link to original

B-Spline

Definition

B-Splines are basis functions for spline function in which order is the same, meaning that all possible spline function can be expressed as a linear combination of B-splines.

Consider the knots sorted into non-decreasing order. For the given sequence of knots, the B-splines of degree is defined as And the higher -degree B-splines are defined by recursion

The order spline function on a given set of knots can be expressed as

Link to original

Smoothing Splines

Definition

where is a smoothing parameter. If , the smoothing spline is any interpolating spline. If , the smoothing spline is Least Square line.

Smoothing splines is a Spline basis method that avoids the knot selection problem. Among all functions , find one that minimizes the penalized RSS.

Facts

The unique analytic solution of smoothing spline is a Natural Cubic Splines with knots at the unique values of the ‘s

Link to original

Regression Model for Censored Data

Survival Function and Hazard Function

Survival Function

Definition

Let  be a non-negative Random Variable representing survival time with PDF  and CDF . The survival function of the random variable is defined as

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.

Link to original

Hazard Function

Definition

Hazard Function

Let  be a non-negative Random Variable representing survival time with PDF  and CDF . The hazard function of the random variable is defined as where is the Survival Function.

The hazard function refers to the rate of occurring event at a given time .

Cumulative Hazard Function

The hazard function can alternatively be represented in terms of the cumulative hazard function, defined as

Facts

The Survival Function , the Cumulative Hazard Function, the density (PDF) , the Hazard Function, and the distribution function (CDF) of survival time are related through

Link to original

Censored Data

Right Censoring

Kinds

Type 1 Censoring

Definition

Type 1 Censoring

when  (’s are constant)

Suppose that  are i.i.d. random variables represent survival times with CDF , and  are the censoring times. And let be the censoring indicator. We observe where .

In a type 1 censoring setting, the censored times are fixed constants, not random variables.

The PDF of the observation is derived as

Likelihood of Type 1 Censoring Data

The Likelihood Function of type 1 censoring data is defined as

Link to original

Type 2 Censoring

Definition

Type 2 Censoring

when

Suppose that  are i.i.d. random variables represent survival times with CDF , and  are the censoring times. And let be the censoring indicator. We observe where .

In a type 2 censoring setting, we observe first  out of  experiment . In other words, for the order statistics of , we only observe . Where  are not constants, but random variables.

Likelihood of Type 2 Censoring Data

The likelihood function of type 2 censored data can be computed using the same equation used for type 1 censored data but computing the joint-PDF of the Order Statistic is easier.

The joint PDF of  is derived as

Link to original

Random Censoring

Definition

Random Censoring

Suppose that  are i.i.d. random variables represent survival times with CDF , and  are the censoring times, which can be both random variable or constant. And let be the censoring indicator.

In a random censoring setting, we only observe where , and the censoring times are  are i.i.d. Random Variable follows PDF  and CDF .

The PDF of the observation is derived as where  is the parameter of interest, and is the: nuisance parameter

Likelihood of Random Censoring Data

The Likelihood Function of random censored data is defined as where is a constant.

Link to original

Link to original

Left Censoring

Definition

Suppose that  are i.i.d. random variables represent survival times, and  are the censoring times, which can be both random variable or constant. And let be the censoring indicator.

In a left censoring setting, we observe where . In other words the event has already occurred before it becomes the subject of observed.

Link to original

Interval-Censored Data

Definition

Suppose that  are i.i.d. random variables represent survival times. Interval censored data is given as an interval, not an exact point of time. We only observe the interval  that includes Interval-censored data is divided into four cases.

Case 1 Interval-Censored Data (Current Status Data)

Data is given the form of or , where is a fixed time point.

Case 2 Interval-Censored Data

Data is given the form of where .

Double Censored Data

Data is given the form of where .

If , then it is right-censored data, if , then it is left-censored data, and if , then it is uncensored data.

Panel Data

Observations are made at discrete time points. The period between these observations can be viewed as an interval.

Link to original

Mean Imputation Method

Definition

Mean imputation method substitute the given interval of case 2 interval-censored data with the mean of the interval if the interval is finite. If , the data is substituted with the left value and treated as right-censored data.

Link to original

Estimation of Survival Function

Kaplan-Meier Estimator

Definition

Kaplan-Meier Estimator

Consider a Random Censoring case . Assume that , and the distinct failure times are where . Let be the number of deaths at , and be the number of alive at where the set is called the risk set at . We only observe

The Kaplan-Meier estimator is derived from the expression where

General Case

As an estimator of , consider

The Kaplan-Meier estimator is defined with the estimated ‘s The cumulative hazard function is estimated by Nelson-Aalen Estimator in the same logic

No Ties Case

When there’s no tie in the observation, , then the failure times are equal to the observation , death is equal to the censoring indicator , and . Thus, the Kaplan-Meier estimator is defined as

Properties

Self-Consistency

An estimator is self-consistent if where

The Kaplan-Meier estimator is the unique self-consistent estimator for where is the largest observation.

Generalized MLE

The Kaplan-Meier estimator gives the Generalized Maximum Likelihood Estimation of the Survival Function .

Strong Consistency

The Kaplan-Meier estimator uniformly Almost Surely converges to

Proof

Consider a function and decompose it to the sum of the subsurvival functions and . where is the uncensored case and is the censored case.

Then, the survival function can be expressed as a function of the subsurvival functions.

Define the empirical subsurvival functions and . The Kaplan-Meier estimator also can be expressed as a function of the empirical subsurvival functions.

By Glivenko-Cantelli theorem, and for all . Since is a continuous function of and ,

Asymptotic Normality

Kaplan-Meier estimator has asymptotic normality. where , , and .

The variance of the estimator is estimated by Greenwood’s formula For the no ties case, the formula is

Examples

case where

Facts

Kaplan-Meier estimator has Self-Consistency and Asymptotic Normality, and it is generalized MLE

If no censoring, Kaplan-Meier estimator is just the Empirical Survival Function.

Link to original

Kernel Estimation for Survival Analysis

Definition

Kaplan-Meier estimator is a step function. So it is difficult to calculate its quantile function and Density Function. The Kernel Density Estimation is used to make it smooth function.

Let be i.i.d. survival time with a distribution , and be i.i.d. censoring time with a distribution . We can observe where and is censoring indicator.

Without Censoring

For the complete data, the Kernel Density Estimation is defined as where is the kernel, is the scaled kernel, and is a smoothing parameter.

The kernel estimator for the Distribution Function is defined as where

With Censoring

For the censored data, the weights for each observation is defined as a jump size in Kaplan-Meier Estimator. where is the jump size at in Kaplan-Meier Estimator.

Thus, the Kernel Density Estimation for the Survival Function is

Link to original

Cox Regression Model

Cox Proportional Hazards Model

Definition

Cox proportional hazards model assume that covariates affect the Hazard Function.

Let be i.i.d. survival time, and be i.i.d. censoring time. We can observe , where and is censoring indicator, and have covariates . Then, the Cox proportional hazards model is defined as where is called the baseline hazard function, i.e. hazard at

Conditional Likelihood

Let (no ties case), and be the risk set. For each uncensored time , Therefore, Taking the product of these conditional probabilities gives a conditional likelihood where is the indicator set for uncensored samples.

The is not a likelihood. However, Cox suggested treating the conditional likelihood as an ordinary likelihood to find the Maximum Likelihood Estimation.

Since there’s no analytic solution for the MLE, iterative methods such as Newton–Raphson method is used to estimate the coefficient .

The hazard ratio represents the relative change in Hazard Rate for a one-unit increase in the covariate .

Goodness-of-Fit Test

For testing the null hypothesis , Cox suggested the Rao Test.

Asymptotic Normality of MLE

where is the observed Fisher Information

Estimation of Survival Function

Under the Cox proportional hazards model, To estimate , we can use for but we still need to estimate , , or .

Breslow suggested the estimators of and as If

If , then is the Kaplan-Meier Estimator

It has a few drawbacks

  • can take negative values.

Tsiatis suggested a non-negative version of where

Link suggested using the linear smooth of .

Discrete on Grouped Data

When data is discrete or grouped, there are ties at each failure. Denote the ordered discrete failure time by and let be the risk set at , be the death set at , and .

Cox suggested combining the all possible permutations. However, it is computationally infeasible. where , and is the size subset of

Peto suggested an alternative likelihood that instead of all possible permutations, use the same contribution.

Time Dependent Covariates

In the case, the covariate depends on time. We observe and the conditional likelihood defined as where is the indicator set for uncensored samples, and is the risk set.

Facts

Any two individuals have hazard functions that are constant multiples of the one another.

The Survival Function of the Cox proportional hazard model is a family of Lehmann alternatives. where .

If , , where is the indicator set for sample 1, and there are no ties, then Cox test is exactly equal to the Mantel-Haenszel Test.

Link to original

Linear Regression Model

Accelerated Life Model

Definition

Consider the random variable represents survival time with Hazard Function with , , and And assume that the survival time of individual with covariate is defined as If then the covariate accelerates the time to failure. The model based on this assumption is called accelerated failure time (AFT) model.

Under AFT model

Let , then where .

Assume that , where is a Random Variable represents error term. Then, the AFT model becomes

Relationship between and . where

If then , and if , then where .

Link to original

Miller Estimator

Definition

Let be i.i.d. survival time, and be i.i.d. censoring time. We can observe , where and is censoring indicator

Suppose a Simple Linear Regression model

With no censoring present, the least squares estimators of the parameters are obtained by minimizing where is the Empirical Distribution Function of where .

With censoring present, Miller proposed to minimize where is the Kaplan-Meier Estimator based on and the weights is its jump size.

If the last observation is censored, then . Hence, change the last observation to be uncensored, so that .

Link to original

Buckley-James Estimator

Definition

Let be i.i.d. survival time, and be i.i.d. censoring time. We can observe , where and is censoring indicator

If we can observe the true survival time , we can make a model However, we can’t observe , but only censored , and Buckley and James proposed an Unbiased Estimator for Since we also can not observe , we estimate it again. where , is the Kaplan-Meier Estimator based on and the weights is its jump size.

The variance of the estimator is estimated by where , , , and

Link to original