Preliminaries

Matrix Algebra

Orthonormal Matrix
Definition

$Q^{⊺} Q = Q Q^{⊺} = I and Q^{⊺} = Q^{- 1}$ The matrix whose column and rows are orthonormal vectors.

Examples

Rotation Matrix

Permutation Matrix

Facts

Every orthonormal matrix is Unitary Matrix.

For a vector $y = x_{1} q_{1} + x_{2} q_{2} + \dots + x_{n} q_{n}$ , which can be expressed as a linear combination of column vectors of the orthonormal matrix $Q$ , we can easily find the coefficients $x_{i} = q_{i}^{⊺} y$ of the linear combination using the Inner product. $y = Qx \to Q^{⊺} y = I Q^{⊺} Q x \to Q^{⊺} y = x$

Link to original

Eigendecomposition
Definition

$Av = λ v$ where $A$ is a $n \times n$ matrix and $v \neq = 0$

If $Av$ is a scalar multiple of non-zero vector $v$ , then $λ$ is the eigenvalue and $v$ is the eigenvector.

Characteristic Polynomial

$∣ A - λ I ∣ = 0$ The values $λ_{i}$ satisfying the characteristic polynomial, are the eigenvalues of the matrix $A$

Eigenspace

$E = {v : (A - λ I) v = 0}$

The set of all eigenvectors of $A$ corresponding to the same eigenvalue, together with the zero vector.

The Kernel of the matrix $(A - λ I)$

Eigenvector: non-zero vector in the eigenspace

Algebraic Multiplicity

$μ_{A} (λ_{i})$

Let $λ_{i}$ be an eigenvalue of an $n \times n$ matrix $A$ . The algebraic multiplicity $μ_{A} (λ_{i})$ of the eigenvalue is its multiplicity as a root of a Characteristic Polynomial, that is, the largest k such that $(λ - λ_{i})^{k}$

Geometric Multiplicity

$γ_{A} (λ_{i})$ Let $λ_{i}$ be an eigenvalue of an $n \times n$ matrix $A$ . The geometric multiplicity $γ_{A} (λ_{i})$ of the eigenvalue is the dimension of the Eigenspace associated with the eigenvalue.

Computation

Find the solution^[eigenvalues] of the Characteristic Polynomial.

Find the solution^[eigenvectors] of the Under-Constrained System $(A - λ_{i} I) v_{i} = 0$ using the found eigenvalue.

Facts

There exists at least one eigenvector corresponding to the eigenvalue $λ$

Eigenvectors corresponding to different eigenvalues are always linearly independent.

When $A$ is a normal or real Symmetric Matrix, the eigendecomposition is called Spectral Decomposition

$tr (A) = \sum_{i = 1}^{n} λ_{i}$ The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix $A - λ I$ only has $λ$ term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of $λ^{n - 1}$ of $det (A - λ I)$ is always $- (a_{11} + a_{22} + \dots + a_{nn})$ . Also, since $λ_{i}$ are the solution of the Characteristic Polynomial, the expression is factorized as $det (A - λ I) = (λ - λ_{1}) (λ - λ_{2}) \dots (λ - λ_{n})$ and the coefficient of $λ^{n - 1}$ become a $- (λ_{1} + λ_{2} + \dots + λ_{n})$ Therefore, $- (a_{11} + a_{22} + \dots + a_{nn}) = - (λ_{1} + λ_{2} + \dots + λ_{n})$ and $Trace (A) = \sum_{i = 1}^{n} λ_{i}$ .

$∣ A ∣ = \prod_{i = 1}^{n} λ_{i}$ The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Not all matrices have $n$ linearly independent eigenvectors.

When $Av = λ v$ holds,

the eigenvalues of $A^{- 1}$ are $\frac{1}{λ}$ and the eigenvectors of the matrix $A^{- 1}$ are the same as $A$ .

the eigenvalues of $A^{k}$ is $λ^{k}$

the eigenvalues of $c A$ is $c λ$

the eigenvalues of $A + c I$ is $λ + c$

the eigenvalues of $(A + c I)^{- 1}$ is $1/ (λ + c)$

$1 \leq γ_{A} (λ_{i}) \leq μ_{A} (λ_{i}) \leq n$

An eigenvalue’s Geometric Multiplicity cannot exceed its Algebraic Multiplicity

$\exists P s.t. A = P D P^{- 1} \Leftrightarrow \forall λ_{i} : γ_{A} (λ_{i}) = μ_{A} (λ_{i})$

the matrix is diagonalizable $\Leftrightarrow$ the Geometric Multiplicity is equal to the Algebraic Multiplicity for all eigenvalues

Let $A$ be a symmetric matrix. Then, $A$ has $r$ eigenvalues equal to $1$ and the rest zero $\Leftrightarrow A^{2} = A, rank (A) = r$ .

Link to original

The non-zero eigenvalues of $A B$ are the same as those of $B A$ .

Link to original

Projection Matrix
Definition

$P = A (A^{⊺} A)^{- 1} A^{⊺}$

For some vector $b$ , $P b$ is the projection of $b$ onto $A$

Facts

The projection matrix $P$ is symmetric Idempotent Matrix

Consider a Symmetric Matrix $P$ , then $P$ is idempotent with rank $r$ if and only if $r$ eigenvalues are $1$ and $n - r$ eigenvalues are $0$ .

$tr (P) = rank (P)$

The projection matrix is Positive Semi-Definite Matrix

If $P_{1}$ and $P_{2}$ are projection matrices, and $P_{1} - P_{2}$ is Positive Semi-Definite Matrix, then

$P_{1} P_{2} = P_{2} P_{1} = P_{2}$

$P_{1} - P_{2}$ is a projection matrix.

Link to original

QR Decomposition
Definition

$A = QR$ Decomposition of matrix $A$ into a product $A = QR$ of an orthonormal matrix $Q$ and an upper triangular matrix $R$ .

Computation

Using the Gram Schmidt Orthonormalization

QR decomposition can be computed by Gram Schmidt Orthonormalization. Where $Q = [q_{1}, \dots, q_{n}]$ is the matrix of orthonormal column vectors obtained by the orthonormalization and $R = q_{1}^{⊺} a_{1} 00 ⋮ 0 q_{1}^{⊺} a_{2} q_{2}^{⊺} a_{2} 0 ⋮ 0 q_{1}^{⊺} a_{3} q_{2}^{⊺} a_{3} q_{3}^{⊺} a_{3} ⋮ 0 \dots \dots \dots ⋱ \dots q_{1}^{⊺} a_{n} q_{2}^{⊺} a_{n} q_{3}^{⊺} a_{n} ⋮ q_{n}^{⊺} a_{n}$
Link to original

Cholesky Decomposition
Definition

$A = L L^{†}$

Decomposition of a Positive-Definite Matrix into the product of lower triangular matrix and its Conjugate Transpose.

Computation

Let $A = a_{11} a_{21} a_{31} a_{12} a_{22} a_{22} a_{13} a_{23} a_{33}$ be Positive-Definite Matrix. Then, $A = L L^{⊺} = L_{11} L_{21} L_{31} 0 L_{22} L_{32} 00 L_{33} L_{11} 00 L_{21} L_{22} 0 L_{31} L_{32} L_{33} = L_{11}^{2} L_{21} L_{11} L_{31} L_{11} L_{21}^{2} + L_{22}^{2} L_{31} L_{21} + L_{32} L_{22} (symmetric) L_{31}^{2} + L_{32}^{2} + L_{33}^{2}$ By setting the first-row first-column entry $L_{11} = a_{11}$ , can find other entries using substitution. $L_{21} = a_{21} / L_{11}, L_{31} = a_{31} / L_{11}, L_{22} = a_{22} - L_{21}^{2}, L_{32} = (a_{32} = L_{31} L_{21}) / L_{22}, L_{33} = a_{33} - L_{31}^{2} - L_{32}^{2}$ By summarizing, $L_{ii} = a_{ii} - \sum_{k = 1}^{i - 1} L_{ik}^{2}$ , $L_{ij} = \frac{1}{L _{jj}} (a_{ij} - \sum_{k = 1}^{j - 1} L_{ik} L_{jk})$
Link to original

Spectral Theorem
Definition

$A = A^{†} \Leftrightarrow A = Q Λ Q^{†}$ where $Q$ is a Unitary Matrix, and $Λ = diag (λ_{1}, λ_{2}, \dots, λ_{n})$

A matrix $A$ is a Hermitian Matrix if and only if $A$ is Unitary Diagonalizable Matrix.

Facts

Every Hermitian Matrix is diagonalizable, and has real-valued Eigenvalues and orthonormal eigenvector matrix.

For the Hermitian Matrix $A^{†} = A$ , the every eigenvalue is real.

Proof For the eigendecomposition $Ax = λ x$ , $x^{†} Ax = x^{†} λ x = λ x^{†} x = λ ∣∣ x ∣∣ \Rightarrow λ = \frac{x ^{†} Ax}{∣∣ x ∣∣}$ . Since $A$ is hermitian, $x^{†} Ax \in R$ and norm is always real. Therefore, every eigenvalue is real.

For the Hermitian Matrix $A^{†} = A$ , the eigenvectors from different eigenvalues are orthogonal.

Proof Let the eigendecomposition $A x_{1} = λ_{1} x_{1}, A x_{2} = λ_{2} x_{2}$ where $λ_{1} \neq = λ_{2}, x_{1} \neq = x_{2}$ . By the property of hermitian matrix, $(A x_{1})^{†} x_{2} = x_{1}^{†} A^{†} x_{2} = x_{1}^{†} λ_{2} x_{2} = λ_{2} x_{1}^{†} x_{2}$ and $(A x_{1})^{†} x_{2} = (λ_{1} x_{1})^{†} x_{2} = \overset{ˉ}{λ}_{1} x_{1}^{†} x_{2} = λ_{1} x_{1}^{†} x_{2}$ So, $λ_{2} x_{1}^{†} x_{2} = λ_{1} x_{1}^{†} x_{2} \Rightarrow (λ_{1} - λ_{2}) x_{1}^{†} x_{2} = 0$ . Since $λ_{1} \neq = λ_{2}$ , $x_{1}^{†} x_{2} = 0$ . Therefore, $x_{1}^{†}, x_{2}$ are orthogonal.

Link to original

Singular Value Decomposition
Definition

$A_{n \times m} = U_{n \times n} D_{n \times m} (V_{m \times m})^{⊺} = \sum_{i = 1}^{m i n (n, m)} d_{i} U_{i} V_{i}^{⊺}$ An arbitrary matrix $A_{n \times m}$ can be decomposed to $UD V^{⊺}$ .

For $n \geq m$ $A_{n \times m} = U_{n \times m} D_{m \times m} (V_{m \times m})^{⊺}$ where $U^{⊺} U = I_{m}$ and $V^{⊺} V = V V^{⊺} = I_{m}$ and $D$ is Diagonal Matrix

For $n \leq m$ $A_{n \times m} = U_{n \times n} D_{n \times n} (V_{m \times n})^{⊺}$ where $U^{⊺} U = U U^{⊺} = I_{n}$ and $V^{⊺} V = I_{n}$ and $D$ is Diagonal Matrix

Calculation

For matrix $A_{m \times n} = UD V^{⊺}$

$V_{n \times n}$ is the matrix of orthonormal eigenvectors of $A^{⊺} A$ $U_{m \times m}$ is the matrix of orthonormal eigenvectors of $A A^{⊺}$ $D_{m \times n}$ is the Diagonal Matrix made of the square roots of the non-zero eigenvalues of $A^{⊺} A$ and $A A^{⊺}$ sorted in descending order.

If the eigendecomposition is $A^{⊺} Ax = λ x$ , then the eigenvalues $λ \geq 0$ and the eigenvectors $x \in C (A^{⊺})$ are orthonormal by Spectral Theorem. If the $rank (A) = r \leq n$ , then $λ_{1} ⪈ λ_{2} ⪈ \dots \geq λ_{r}, λ_{r + 1} = \dots λ_{n} = 0$ . Where $σ_{i} = λ_{i}_{i = 1, 2, \dots, r}$ are called the singular values.

Now, the Orthonormal Matrix $V = [V_{1}, V_{2}]$ is calculated using the singular values. Where $V_{1} = [v_{1}, v_{2}, \dots, v_{r}]$ is the Orthonormal Matrix of eigenvectors corresponding to the non-zero eigenvalues and $V_{2} = [v_{r + 1}, v_{r + 2}, \dots, v_{n}]$ is the Orthonormal Matrix of eigenvectors corresponding to zero eigenvalues where each $v_{r} + i \in N (A^{⊺} A) = N (A)$ , and $D_{m \times n} = diag (σ_{1}, σ_{2}, \dots, σ_{r}, 0, \dots, 0)$ is a rectangular Diagonal Matrix.

Since $V$ is an orthonormal matrix and $D$ is a Diagonal Matrix, $AV = UD \Rightarrow A [v_{1}, v_{2}, \dots, v_{n}] = [u_{1}, u_{2}, \dots, u_{m}] D \Rightarrow A v_{i} = σ_{i} u_{i} \Rightarrow u_{i} = \frac{1}{σ _{i}} A v_{i}$ . Now, the Orthonormal Matrix $U = [U_{1}, U_{2}]$ is calculated using the linear system $\frac{1}{σ _{i}} A v_{i}$ and the null space of $A$ , $N (A)$ Where $U_{1} = [u_{1}, u_{2}, \dots, u_{r}]$ is the Orthonormal Matrix of the vectors $u_{{1, 2, \dots, r}} \in C (A)$ obtained from the system equation $\frac{1}{σ _{i}} A v_{i}$ And $U_{2} = [u_{r + 1}, u_{r + 2}, \dots, u_{m}]$ is the Orthonormal Matrix of the vectors $u_{{r + 1, r + 2, \dots, m}} \in N (A^{⊺})$ . which is corresponding to zero eigenvalues

The Orthonormal Matrix $U = [u_{1}, u_{2}, \dots, u_{m}]$ can also be formed by the eigenvectors of $A A^{⊺}$ similarly to calculating of $V$ .

Facts

$U, V$ are the Unitary Matrix

Let $A$ be a real symmetric Positive Semi-Definite Matrix Then, the Eigendecomposition(Spectral Decomposition) of $A$ and the singular value decomposition of $A$ are equal.

$A = PΛ P^{⊺} = UD V^{⊺}$ where $P = U = V$ and $Λ = D$ are non-negative and the shape of the every matrix is $n \times n$

Visualization

every matrix can be decomposed as a

$V^{⊺}$ : rotation and reflection

$D$ : scaling

$U$ : rotation and reflection

Link to original

Non-Negative Matrix Factorization
Definition

$X_{n \times p} \approx W_{n \times r} H_{r \times p}$ where $r \leq max (n, p)$

Non-negative matrix factorization (NMF) is an algorithm where a matrix $X$ is factorized into two matrices $W$ and $H$ have no negative elements.

The matrices $W$ and $H$ are found by maximizing $i = 1 \sum n j = 1 \sum p [x_{ij} lo g (WH)_{ij} - (WH)_{ij}]$
Link to original

Model Assessment and Selection

Bias-Variance Decomposition
Definition

Assume that $Y = f (X) + ϵ$ where $E (ϵ) = 0$ and $Var (ϵ) = σ_{ϵ}^{2}$ . Then, under squared error loss. $Err (x_{0}) = E [(Y - \hat{f} (x_{0}))^{2} ∣ X = x_{0}] = σ_{ϵ}^{2} + Bias^{2} (\hat{f} (x_{0})) + Var (\hat{f} (x_{0})) = Irreducible error + Bias^{2} + Variance$
Link to original

Mallow's Cp
Definition

$C_{p} = \frac{SS E _{p}}{s ^{2}} - (n - 2 p) = \overline{err} + \frac{p}{n} \overset{σ}{^}^{2}$ where $p - 1$ is the number of explanatory variables, $s^{2}$ is a sample variance under the full model, and $\overline{err} = \frac{RSS}{n}$

Mallow’s $C_{p}$ is used to assess the fit of a regression model. A small value of $C_{p}$ means that the model is relatively precise.
Link to original

Akaike Information Criterion
Definition

$A I C = - \frac{2}{n} i = 1 \sum n ln f (X_{i} \hat{θ}) + \frac{2 p}{n}$ where $\hat{θ}$ is the MLE of the postulated model’s parameter $θ$
Link to original

Bayesian Information Criterion
Definition

$B I C = \frac{SS E _{p}}{n σ ^{2}} + \frac{p l n n}{n}$

Facts

BIC more penalize than AIC when $p$ is large
Link to original

Vapnik–Chervonenkis Dimension
Definition

Vapnik–Chervonenkis (VC) dimension is measure of the size of a class of sets. The VC dimension of the class ${f (x, α)}$ is the largest number of points that can be shattered (can always learn a perpect classifier for any labeling) by members of ${f (x, α)}$ .

Examples

The VC dimension of linear indicator functions in the plane is 3.

$sin (αx)$ has infinite VC dimension.

Link to original

Cross Validation
Definition

Partition a set of data into $K$ sets $I_{j}, j = 1, 2, \dots, K$ and denote a function $κ : {1, 2, \dots, n} \to {1, 2, \dots, K}$ such that $κ (i) = j \Leftrightarrow i \in I_{j}$ . For the dataset ${(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})}$ , let $\hat{f}_{- j}$ be the estimator based on observations except $I_{j}$ , then the cross validation estimation for the prediction error is defined as $\hat{PE}_{C V} (\hat{f}) = \frac{1}{n} i = 1 \sum n L (Y_{i}, \hat{f}_{- κ (i)} (X_{i}))$ where $L$ is a loss function

Facts

If the data is partitioned into $k$ group with equal size, then it is called a $k$ -fold cross validation.

Link to original

Leave-One-Out-Cross-Validation
Definition

$LOOCV (\hat{f}) = \frac{1}{n} i = 1 \sum n L (Y_{i}, \hat{f}_{- i} (X_{i}))$ Cross Validation with $K = n$
Link to original

Generalized Croos Validation
Convenient approximation to Leave-One-Out-Cross-Validation of a linear model $\hat{y} = Sy$ , $GCV (\hat{f}) = \frac{1}{n} i = 1 \sum n [\frac{y _{i} - f ^ ( x _{i} )}{1 - tr ( S ) / n}]^{2} \to a LOOCV (\hat{f})$
Link to original

Maximum Likelihood Methods

Maximum Likelihood Estimation
Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , then the MLE $\hat{θ}_{MLE}$ of $θ$ is estimated as $\hat{θ}_{MLE} = θ argmax L (θ ∣ x)$

Regularity Conditions

R0: The pdfs are distinct, i.e. $θ \neq = θ^{'} \Rightarrow f (x_{i} ∣ θ) \neq = f (x_{i} ∣ θ^{'})$

R1: The pdfs have same supports $\forall θ$

R2: The true value $θ_{0}$ is an interior point in $Ω$

R3: The pdf $f (x ∣ θ)$ is twice differentiable with respect to $θ \in Ω$

R4: $\frac{\partial}{\partial θ ^{2}} \int f (x ∣ θ) d x = \int \frac{\partial}{\partial θ ^{2}} f (x ∣ θ) d x$

R5: The pdf $f (x ∣ θ)$ is three times differentiable with respect to $θ \in Ω$ , $\forall θ \in Ω, \frac{\partial ^{3}}{\partial θ ^{3}} ln f (x ∣ θ) \leq M (x)$ , and $\exists c \in R, \exists M (x), \forall∣ θ - θ_{0} ∣ < c, \forall$ interior point $x, E_{θ_{0}} [M (X)] < \infty$

Properties

Functional Invariance

If $\hat{θ}$ is the MLE for $θ$ , then $g (\hat{θ})$ is the MLE of $g (θ)$

Consistency

Under R0 ~ R2 Regularity Conditions, let $θ_{0}$ be a true parameter, $f (x ∣ θ)$ is differentiable with respect to $θ \in Ω$ , then $\frac{\partial}{\partial θ} L (θ) = 0$ has a solution $\hat{θ}_{n}$ such that $\hat{θ}_{n} \to P θ_{0}$

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let $X_{1}, X_{2}, \dots, X_{n}$ be Random Sample with PDF $f (x ∣ θ)$ , where $θ \in Ω$ , $\hat{θ}_{n}$ be a consistent Sequence of solutions of MLE equation $\frac{\partial l ( θ )}{\partial θ} = 0$ , and $0 < I (θ_{0}) < \infty$ , then $n (\hat{θ}_{n} - θ_{0}) \to D N (0, \frac{1}{I ( θ _{0} )})$ where $I (θ_{0})$ is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, $n I (\hat{θ}) (\hat{θ} - θ) \to D N (0, 1)$ Thus, $100 (1 - α) %$ confidence interval of for $θ$ is $(\hat{θ} - z_{α /2} \frac{1}{n I ( θ ^ )}, \hat{θ} + z_{α /2} \frac{1}{n I ( θ ^ )})$

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let $g (x)$ be a continuous function and $g^{'} (θ_{0}) \neq = 0$ , then $n (g (\hat{θ}_{n}) - g (θ_{0})) \to D N (0, \frac{g ^{'} ( θ _{0} ) ^{2}}{I ( θ _{0} )})$

Facts

Under R0 and R1 regularity conditions, let $θ_{0}$ be a true parameter, then $\forall θ \neq = θ_{0}, n \to \infty lim P_{θ_{0}} [L (θ_{0}) > L (θ)] = 1$

Link to original

The EM Algorithm

Expectation-Maximization Algorithm
Definition

Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be an observed data, $Z = (Z_{1}, Z_{2}, \dots, Z_{n})$ be an unobserved (latent) variable, $X, Z$ are independent, $g (X ∣ θ)$ be a joint pdf of $X$ , $h (X, Z ∣ θ)$ be a joint pdf of $X, Z$ , $k (Z ∣ X, θ)$ be a conditional pdf of $Z$ given $X$

By the definition of a conditional pdf, we have the identity $k (Z ∣ X, θ) = \frac{h ( X , Z ∣ θ )}{g ( X ∣ θ )}$

The goal of the EM algorithm is maximizing the observed likelihood $L (θ ∣ X) = g (X ∣ θ)$ using the complete likelihood $L^{c} (θ ∣ X, Z) = h (X, Z ∣ θ)$ .

Using the definition conditional pdf, we derive the identity for an arbitrary but fixed $θ_{0} \in Ω$
$&= \int \ln[h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})] k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0}) d \mathbf{Z} - \int \ln[k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})]k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})d\mathbf{Z}\\ &= E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}] - E_{\boldsymbol{\theta}_{0}}[\ln k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}) | \boldsymbol{\theta}_{0}, \mathbf{X}] \end{aligned}$$ Let the first term of RHS be a quasi-likelihood function $$Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X}) := E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}]$$ EM algorithm maximizes $Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X})$ instead of maximizing $\ln L(\boldsymbol{\theta}|\mathbf{X})$ # Algorithm 1. Expectation Step: Compute $$Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X}) := E_{\hat{\boldsymbol{\theta}}^{(m)}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \hat{\boldsymbol{\theta}}_{m}, \mathbf{X}]$$ where the $m = 0, 1, \dots$, and the expectation is taken under the conditional pdf $k(\mathbf{Z} | \mathbf{X}, \hat{\boldsymbol{\theta}}^{(m)})$ 2. Maximization Step: $$\hat{\boldsymbol{\theta}}^{(m+1)} = \underset{\boldsymbol{\theta}}{\operatorname{arg max}} Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X})$$ # Properties ## Convergence The [[Sequence]] of estimates $\hat{\boldsymbol{\theta}}^{(m)}$ satisfies $$L(\hat{\boldsymbol{\theta}}^{(m+1)}|\mathbf{X}) \leq L(\hat{\boldsymbol{\theta}}^{(m)}|\mathbf{X})$$ Therefore the [[Sequence]] of EM estimates converge to (at least local) optimal$ Link to original

Bayesian Methods

Bayes Risk
Definition
$r(\theta, \delta) &= \int_{\Theta} R(\theta, \delta) \pi(\theta) d\theta = E_\theta[R(\theta, \delta)] = E_\theta[E_{x}[L(\theta, \delta(x))]]\\ &= \int_{\Theta} \int_{X} L(\theta, \delta(x))p(x|\theta) \pi(\theta) dx d\theta = \int_{X} \int_{\Theta} L(\theta, \delta(x))p(\theta|x)p(x) d\theta dx\\ &= \int_{X}E_\theta[L(\theta, \delta(x))|X=x]p(x)dx = \int_{X}\rho(x, \pi)p(x)dx \end{aligned}$$ where $L(\theta, \delta)$ is [[Loss Function]], $R(\theta, \delta)$ is [[Risk Function]], and $\rho(x, \pi)$ is a [[Posterior Risk]].$ Link to original

Posterior Risk
Definition

$ρ (π, δ) = E_{θ} [L (θ, δ (x)) ∣ X = x] = \int L (θ, δ (x)) p (θ ∣ x) d θ$

where $L (θ, δ)$ is Loss Function, $R (θ, δ)$ is Risk Function, and $p (θ ∣ x)$ is a posterior probability
Link to original

Bayes Estimator
Definition

$\hat{δ}_{Bayes} = δ argmin r (θ, δ) = δ argmin ρ (π, δ)$

Estimator that minimizes the Bayes Risk $r (θ, δ)$ or Posterior Risk $ρ (π, δ)$

Facts

Under Squared Error Loss, the Bayes estimator is a posterior mean, and a posterior mode under Absolute Error Loss.

Consider a regression model $y \sim N_{n} (X β, σ^{2} I)$ and the prior distribution for $β$ , $β \sim N_{p} (m, σ^{2} V)$ . Then, the Bayes estimator under the Squared Error Loss is obtained as $\hat{β}_{Bayes} = (X^{⊺} X + V^{- 1})^{- 1} (V^{- 1} m + X^{⊺} y)$ If $V = λ^{- 1} I_{p}$ for some $λ > 0$ and $m = 0$ , then the Bayes estimator is the same as ridge estimator $\hat{β}_{Bayes} = \hat{β}_{ridge} = (X^{⊺} X + λ I)^{- 1} X^{⊺} y$ If $V = λ^{- 1} (X^{⊺} X)^{- 1}$ and $m = 0$ , then the Bayes estimator is the James-Stein regression estimator $\hat{β}_{Bayes} = (1 + λ)^{- 1} \hat{β}$

Link to original

Monte Carlo Integration
Definition

$\int_{a}^{b} h (x) d x = \int_{a}^{b} q (x) f (x) d x = E [q (X)] \approx \frac{1}{n} i = 1 \sum n q (X_{i})$ where $h (x) := q (x) f (x)$ and $f (x)$ is a PDF
Link to original

Gibbs Sampling
Definition

A MCMC algorithm for sampling from a specified Multivariate Distribution when direct sampling from the joint distribution is difficult, but sampling from the Conditional Distribution is more practical.

Algorithm

Begin with some initial value $X^{(0)}$

For $i = 1, \dots k$ generate $X^{(i + 1)}$ from $p (X_{j}^{(i + 1)} ∣ X_{1}^{(i + 1)}, \dots, X_{j - 1}^{(i + 1)}, X_{j + 1}^{(i + 1)}, \dots, X_{n}^{(i + 1)})$

Repeat the above step $k$ times.

Examples

Consider a bivariate random variable $(X, Y)$ , and suppose we want to obtain samples from the marginals $f (x), f (y)$ . Take samples from the conditional distributions $x_{i} \sim f (x ∣ y = y_{i - 1})$ and $y_{i} \sim f (y ∣ x = x_{i - 1})$ . As $k \to \infty$ , the samples $x_{k}, y_{k}$ are eventually samples from $f (x)$ and $f (y)$ respectively.

Facts

It is common to discard some number of samples at the beginning (burn-in period).
Link to original

Regression

Parametric Regression

Simple Linear Regression
Definition

$Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}, i = 1, 2, \dots, n$ where $ϵ_{i}$ ‘s are i.i.d. error terms, with $E (ϵ_{i}) = 0$ and $Var (ϵ_{i}) = σ^{2}$
Link to original

Multiple Linear Regression
Definition

$Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{p - 1} X_{i, p - 1} + ϵ_{i}, i = 1, 2, \dots, n$ where $ϵ_{i}$ ‘s are i.i.d. error terms, with $E (ϵ_{i}) = 0$ and $Var (ϵ_{i}) = σ^{2}$

Matrix Notations

Let $β = (β_{0}, β_{1}, β_{2}, \dots, β_{p - 1})^{⊺}$ and $x_{i} = (1, X_{i 1}, X_{i 2}, \dots, X_{i, p - 1})^{⊺}$ , then the regression model can be express the model as $Y_{i} = x_{i}^{⊺} β + ϵ_{i}, i = 1, 2, \dots, n$

Let $y = [Y_{1}, Y_{2}, \dots, Y_{n}]^{⊺}$ , $X = [x_{1}^{⊺}, x_{2}^{⊺}, \dots, x_{n}^{⊺}]^{⊺}$ , $ϵ = [ϵ_{1}, ϵ_{2}, \dots, ϵ_{n}]^{⊺}$ , then the regression model can be express as $y = X β + ϵ$ where $E [ϵ] = 0$ , and $Cov [ϵ] = σ^{2} I_{n}$
Link to original

Ridge Regression
Definition

$\hat{β}_{ridge} = (X^{⊺} X + λ I)^{- 1} X^{⊺} y = β argmin (y - X β)^{⊺} (y - X β) + λ β^{⊺} β = β argmin (y - X β)^{⊺} (y - X β), subject to β^{⊺} β \leq c$ where $λ \geq 0$ is a complexity parameter that controls the amount of shrinkage.

Ridge regression is particularly useful to mitigate the problem of Multicollinearity in linear regression

Facts

$\hat{y}_{ridge} = X \hat{β}_{ridge} = X (X^{⊺} X + λ I)^{- 1} X^{⊺} y = i = 1 \sum p u_{j} \frac{d _{j}^{2}}{d _{j}^{2} + λ} u_{j}^{⊺} y$ where $X = UD V^{⊺}$ by Singular Value Decomposition

Link to original

Lasso Regression
Definition

$\hat{β}_{lasso} = β argmin (y - X β)^{⊺} (y - X β) + λ ∣∣ β ∣∣ = β argmin (y - X β)^{⊺} (y - X β), subject to ∣∣ β ∣∣ \leq c$

Lasso model assume that the coefficients of the model are sparse.
Link to original

Elastic Net Regularization
Definition

$\hat{β}_{lasso} = β argmin (y - X β)^{⊺} (y - X β) + λ (α β^{⊺} β + (1 - α) ∣∣ β ∣∣) = β argmin (y - X β)^{⊺} (y - X β), subject to (α β^{⊺} β + (1 - α) ∣∣ β ∣∣) \leq c$

Elastic net is a regularized regression method that linearly combined the penalties of the lasso and ridge regressions.
Link to original

Principal Component Analysis
Definition

PCA is a linear dimensionality reduction technique. The correlated variables are linearly transformed onto a new coordinate system such that the directions capturing the largest variance in the data.

Population Version

Given a random vector $x$ , we find a $α$ such that $Var (α^{⊺} x)$ is maximized: $α argmax Var (α^{⊺} x) s.t. α^{⊺} α = 1$ Equivalently, by the Method of Lagrange Multipliers with $α^{⊺} α = 1$ , $α argmax α^{⊺} Σ α - λ (α^{⊺} α - 1)$ By differentiation, the $α$ is given by the eigen value problem $Σ α = λ α$ Thus the $α$ maximizing the variance of $α^{⊺} x$ is the eigenvector corresponding to the largest Eigenvalue.

Sample Version

Given a data matrix $X$ , by Singular Value Decomposition, A matrix $X$ can be factorized as $X = UD V^{⊺}$ . By algebra, $XV = UD =: Z \Rightarrow X v_{i} = d_{i} u_{i} =: z_{i}$ , where we call $z_{i}$ the $i$ -th principal component.

Facts

Since $Var (z_{i}) = Var (X v_{i}) = \frac{d _{i}^{2}}{n}$ and $d_{1} \geq d_{2} \geq \dots \geq d_{p} \geq 0$ $Var (z_{1}) \geq Var (z_{2}) \geq \dots \geq Var (z_{p}) \geq 0$

Link to original

Partial Least Squares Regression
Definition

$\overset{φ}{^}_{m} = α argmax Corr^{2} (y, X a) Var (X α) subject to ∣∣ α ∣∣ = 1, α^{⊺} S \overset{φ}{^}_{l} = 0, l = 1, \dots, m - 1$ where $S = \frac{1}{n} X^{⊺} X$ is the sample covariance matrix.

Unlike PCR maximizing $Var (X a)$ only, PLS finds directions maximizing both $Var (X a)$ and $Corr^{2} (y, X a)$
Link to original

Grouped Lasso

$\hat{β}_{grouped lasso} = β argmin (y - X β)^{⊺} (y - X β) + λ l = 1 \sum L p_{l} ∣∣ β_{l} ∣ ∣_{2}$ where $L$ is the number of groups, $p_{l}$ is the number of variables within the group $l$ , and $∣∣ \cdot ∣ ∣_{2}$ is Euclidean Norm

In grouped lasso, the variables in each group share a regularization parameter.
Link to original

nonparametricNonparametric Regression

Piecewise Polynomials
Definition

Suppose that knots are $ξ_{j}, j = 1, \dots, K$ . Then, the basis functions for order-M spline are defined as: $h_{j} (X) h_{M + l} (X) = X^{j - 1}, j = 1, \dots, M = (X - ξ_{l})_{+}^{M - 1}, l = 1, \dots, K$ where $(x)_{+} = {x 0 x \geq 0 x < 0$

Examples

For cubic Polynomials with two knots $ξ_{1}, ξ_{2}$ , the basis functions are: $h_{1} (X) = 1, h_{2} (X) = X, h_{3} (X) = X^{2}, h_{4} (X) = X^{3}, h_{5} (X) = (X - ξ_{1})_{+}^{3}, h_{6} (X) = (X - ξ_{2})_{+}^{3}$

Facts

Order of piecewise polynomials

constant: 1

linear: 2

quadratic: 3

cubic: 4

Link to original

Natural Cubic Splines
Definition

Suppose that knots are $ξ_{j}, j = 1, \dots, K$ . Then, the basis functions for natural cubic Spline are defined as: $N_{1} (X) = 1, N_{2} (X) = X, N_{k + 2} (X) = d_{k} (X) - d_{K - 1} (X)$ where $d_{k} (X) = \frac{( X - ξ _{k} ) _{+}^{3} - ( X - ξ _{K} ) _{+}^{3}}{ξ _{K} - ξ _{k}}$
Link to original

Smoothing Splines
Definition

$RSS (f, λ) = i = 1 \sum N (y_{i} - f (x_{i}))^{2} + λ \int f^{''} (t)^{2} d t$ where $λ \geq 0$ is a smoothing parameter. If $λ = 0$ , the smoothing spline is any interpolating spline. If $λ = \infty$ , the smoothing spline is Least Square line.

Smoothing splines is a Spline basis method that avoids the knot selection problem. Among all functions $f (x) \in C^{2}$ , find one that minimizes the penalized RSS.

Facts

The unique analytic solution of smoothing spline is a Natural Cubic Splines with knots at the unique values of the $x_{i}$ ‘s

Link to original

Classification

Introduction

Bayes Classifier
Definition

$\hat{f}_{Bayes} (x) = 1 \leq k \leq K argmax P (Y = k ∣ X = x) = 1 \leq k \leq K argmax p_{k} (x) π_{k}$ where $p_{k} (x) = P (x ∣ Y = k)$ is conditional PDF of $x$ given $Y = k$ and $π_{k} = P (Y = k)$ is prior

Properties

Optimality of Bayes Rule

Bayes classifier $\hat{f}_{Bayes} (x)$ minimizes the prediction risk over all classifiers

Facts

For a given $x$ , the Bayes classifier $\hat{f}_{Bayes} (x)$ is deterministic, i.e. it does not depend on the training sample. It can not be used in practice since $P (Y = k ∣ X = x)$ is unknown.

Bayes classifier $\hat{f}_{Bayes} (x)$ maximizes the posterior pdf.

Link to original

Parametric Classifier

Linear Discriminant Analysis
Definition

$1 \leq k \leq K argmax \hat{f}_{k} (x) = 1 \leq k \leq K argmax [ln \overset{π}{^}_{k} + x^{⊺} \hat{Σ}^{- 1} \overset{μ}{^}_{k} - \frac{1}{2} \overset{μ}{^}_{k}^{⊺} \hat{Σ}^{- 1} \overset{μ}{^}_{k}]$ where $\overset{π}{^}_{k} = \frac{n _{k}}{n}$ is the sample ratio, $\overset{μ}{^}_{k} = \frac{1}{n _{k}} i : Y_{i} = k \sum X_{i}$ is the sample mean, and $\hat{Σ} = \frac{1}{n - K} k = 1 \sum K i : Y_{i} = k \sum (X_{i} - \overset{μ}{^}_{k}) (X_{i} - \overset{μ}{^}_{k})^{⊺}$ is the sample variance covariance matrix

LDA is a Bayes Classifier with an assumption that $p_{k} (x) \sim N_{p} (μ_{k}, Σ)$

Facts

LDA assume homogeneity of variance-covariance among classes. Quadratic Discriminant Analysis may be used when the covariances are not equal.

Link to original

Quadratic Discriminant Analysis
Definition

$1 \leq k \leq K argmax \hat{f}_{k} (x) = 1 \leq k \leq K argmax [ln \overset{π}{^}_{k} - \frac{1}{2} ln ∣ \hat{Σ}_{k} ∣ - \frac{1}{2} (x - \overset{μ}{^}_{k})^{⊺} \hat{Σ}_{k}^{- 1} (x - \overset{μ}{^}_{k})]$ where $\overset{π}{^}_{k} = \frac{n _{k}}{n}$ is the sample ratio, $\overset{μ}{^}_{k} = \frac{1}{n _{k}} i : Y_{i} = k \sum X_{i}$ is the sample mean, and $\hat{Σ}_{k} = \frac{1}{n _{k} - 1} i : Y_{i} = k \sum (X_{i} - \overset{μ}{^}_{k}) (X_{i} - \overset{μ}{^}_{k})^{⊺}$ is the sample variance covariance matrix

Bayes Classifier with an assumption that $p_{k} (x) \sim N_{p} (μ_{k}, Σ_{k})$
Link to original

Logistic Regression
Definition

Logistic regression is a Generalized Linear Model with Logit link function. Logistic regression models the log-odds of an event as a linear combination of independent variables. $logit (π) = ln (\frac{π}{1 - π}) = x^{⊺} β$ The probability is calculated as $π (x) = P (Y = 1∣ X = x) = \frac{e x p ( x ^{⊺} β )}{1 + e x p ( x ^{⊺} β )} = \frac{1}{1 + e x p ( - x ^{⊺} β )}$
Link to original

Nonparametric Classifier

Kernel Density Estimation
Definition

$\hat{f}_{h} (x) = \frac{1}{n} i = 1 \sum n K_{h} (x - X_{i}) = \frac{1}{nh} i = 1 \sum n K (\frac{x - X _{i}}{h})$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, $K_{h} (x) := \frac{1}{h} K (\frac{x}{h})$ is the scaled kernel, and $h$ is a smoothing parameter (bandwidth, or window width)

KDE is a non-parametric method to estimate PDF of a random variable based on kernel and wrights.

Multidimensional KDE

$\hat{f}_{H} (x) = \frac{1}{n} i = 1 \sum n K_{H} (x - x_{i}) = \frac{1}{n} ∣ H ∣^{- 1} i = 1 \sum n K (H^{- 1} (x - x_{i}))$ where $K s.t. \int_{- \infty}^{\infty} K (u) d u = 1, K \geq 0$ is the kernel, $K_{H} (x) := ∣ H ∣^{- 1} K (H^{- 1} x)$ is the scaled kernel, and $H$ is a positive-definite bandwidth matrix.

Facts

Widely used kernels

Link to original

Kernel Based Classifier
Definition

$1 \leq k \leq K argmax \hat{f}_{k} (x) = 1 \leq k \leq K argmax \overset{p}{^}_{k} (x) \overset{π}{^}_{k} = 1 \leq k \leq K argmax (\frac{1}{n} ∣ H ∣^{- 1} i : Y_{i} = k \sum K (H^{- 1} (x - X_{i})))$ where $\overset{p}{^}_{k} (x) = \frac{1}{n _{k}} ∣ H ∣^{- 1} i : Y_{i} = k \sum K (H^{- 1} (x - X_{i}))$ and $\overset{π}{^}_{k} = \frac{n _{k}}{n}$

Bayes Classifier with an assumption that $p_{k} (x)$ is the Kernel Density Estimation.
Link to original

K-Nearest Neighbors Algorithm
Definition

Suppose $d_{i} (x) := d (x, X_{i})$ is the distance between the training sample $X_{i}$ and the given point $x$ . Let $d_{(1)} (x) \leq d_{(2)} (x) \leq \dots \leq d_{(n)} (x)$ $1 \leq k \leq K argmax \hat{f}_{k} = 1 \leq j \leq K argmax p_{j} (x) = 1 \leq j \leq K argmax \frac{1}{k} i = 1 \sum n I (d_{i} (x) \leq d_{(k)} (x)) I (Y_{i} = j)$

The k-NN determines a sample’s class using the $k$ -nearest training datum.

Facts

$k$ and $d$ are hyperparameters.

Link to original

Decision Tree
Definition

Decision tree consists of Classification Tree and Regression Tree.
Link to original

Regression Tree
Definition

Decision tree algorithm consists of growing and pruning step. Suppose $R_{1} (j, s) = {x ∣ x_{j} \leq s}, R_{2} (j, s) = {x ∣ x_{j} > s}$

growing: seek the splitting variable $j$ and split point $s$ that solves $min_{j, s} [min_{c_{1}} i : x_{i} \in R_{1} (j, s) \sum (y_{i} - c_{1})^{2} + min_{c_{1}} i : x_{i} \in R_{2} (j, s) \sum (y_{i} - c_{2})^{2}]$

pruning: Suppose the number of terminal nodes in a tree $T$ is $∣ T ∣$ and $R_{i i = 1, \dots, ∣ T ∣}$ is the split regions corresponding to the terminal node. For a tree $T$ , define the cost complexity function. $C_{α} (T) = j = 1 \sum ∣ T ∣ i : x_{i} \in R_{2} \sum (Y_{i} - \overset{ˉ}{Y}_{j})^{2} + α ∣ T ∣$ where $α$ is a tuning parameter regularizing complexity.

Regression function is estimated by $\hat{f} (x) = m = 1 \sum M \hat{C}_{m} I (x \in R_{m})$ A non-parametric regression that partitions explanatory variables into a set of rectangles and takes the average of the response variables in each rectangle as an estimate of the regression function in that region.
Link to original

Classification Tree
Definition

Suppose $Y \in {1, 2, \dots, K}$ , and $R_{1} (j, s) = {x ∣ x_{j} \leq s}, R_{2} (j, s) = {x ∣ x_{j} > s}$ Define a variable $\overset{p}{^}_{mk}$ representing the proportion of class $k$ observations in a node $m$ $\overset{p}{^}_{mk} = \frac{1}{n _{m}} i : x_{i} \in R_{m} \sum I (Y_{i} = k), k = 1, 2, \dots, K$ We classify the observations in a node $m$ to class $k (m) = 1 \leq k \leq K argmax \overset{p}{^}_{mk}$ Suppose the number of terminal nodes in a tree $T$ is $∣ T ∣$ and $R_{i i = 1, \dots, ∣ T ∣}$ is the split regions corresponding to the terminal node. For a tree $T$ , define the cost complexity function. $C_{α} (T) = m = 1 \sum ∣ T ∣ n_{m} Q_{m} (T) + α ∣ T ∣$ where $Q_{m} (T)$ is an impunity function, $α$ is a tuning parameter regularizing complexity.

The tree minimizes $C_{α}$ is the selected as optimal.

Impunity Functions

Misclassification error: $\frac{1}{n _{m}} i : x_{i} \in R_{m} \sum I (Y_{i} \neq = k (m)) = 1 - \overset{p}{^}_{mk}$

Gini index: $k = 1 \sum K \overset{p}{^}_{mk} (1 - \overset{p}{^}_{mk})$

Cross entropy on deviance: $- k = 1 \sum K \overset{p}{^}_{m} k ln (\overset{p}{^}_{m} k)$

Link to original

Information Gain
Definition

The information gain of a node $t$ of a decision tree is defined as $I G (t) = Q (t) - \frac{n _{L}}{n _{t}} Q (t_{L}) - \frac{n _{R}}{n _{t}} Q (t_{R})$ where $Q (t)$ is the impunity of node $t$ , $n_{t}$ is the number of samples used in $t$ , $t_{L}, t_{R}$ are children nodes of the $t$ .
Link to original

Feature Importance
Definition

The feature importance of a feature $x_{i}$ used for Decision Tree is defined as $I (x_{i}) = \frac{t \in T ( x _{i} ) \sum I G ( t )}{j = 1 \sum p t \in T ( x _{j} ) \sum I G ( t )}$ where $I G (t)$ is the Information Gain of a node $t$ , and $T (x_{i})$ is a set of nodes using the feature $x_{i}$
Link to original

Confusion Matrix
Definition

Predicted Positive (PP) Predicted Negative (PN)
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Metrics

Accuracy
Definition

$\frac{TP + TN}{TP + TN + FP + FN}$
Link to original

Recall
Definition

$\frac{TP}{TP + FN}$

Recall, Sensitivity, or True positive rate means that the rate of correctly predicted cases out of all the actual positive cases..
Link to original

Precision
Definition

$\frac{TP}{TP + FP}$

Precision means that the rate of actually positive cases out of cases predicted as positive.
Link to original

Specificity
Definition

$\frac{TN}{FP + TN}$

Specificity means that the rate of correctly predicted cases out of all the actual negative cases.
Link to original

Type 1 Error
Definition

$\frac{FP}{FP + TN}$
Link to original

Type 2 Error
Definition

$\frac{FN}{TP + FN}$
Link to original

F-Score
Definition

F1 Score

$F_{1} = \frac{2}{recall ^{- 1} + precision ^{- 1}} = 2 \frac{precision \cdot recall}{precision + recall}$ The harmonic mean of Precision and Recall.

F-beta score

$F_{β} = (1 + β^{2}) \frac{precision \cdot recall}{( β ^{2} \cdot precision ) + recall}$ where $β > 0$

Recall is considered $β$ times as important as Precision.
Link to original

Positive Predictive Value
Definition

$\frac{Sensitivity \times Prevalence}{( Sensitivity \times Prevalence ) + (( 1 - Specificity ) \times ( 1 - Prevalence ))}$ where $Prevalence = \frac{positive cases}{total population}$

Positive predictive value (in medical statistics and epidemiology) means that the rate of actually positive cases out of cases predicted as positive.
Link to original
Link to original

	Predicted Positive (PP)	Predicted Negative (PN)
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Receiver Operating Characteristic Curve
Definition

A receiver operating characteristic curve (ROC curve) is a plot of the True Positive Rate and False Negative Rate at each threshold setting.

AUC

The area under the ROC curve is called AUC (Area under curve)
Link to original

Support Vector Machines

Hyperplane
Definition

Vector Equation for a hyperplane

$f (x) = β_{0} + β^{⊺} x = 0$

Facts

Suppose the dimension of $x$ is $p$

$p = 1$ : point in $1$ -dimensional space.

$p = 2$ : line in $2$ -dimensional space.

$p = 3$ : plane in $3$ -dimensional space.

$p > 3$ : hyperplane in $p$ -dimensional space.

$\forall x_{1}, x_{2} \in H, β^{⊺} (x_{1} - x_{2}) = 0 \Rightarrow β ⊥ H$

$β$ is perpendicular to the surface $H$

$\forall x \in R^{p + 1}, d_{s} (x, H) = β^{* ⊺} (x - x_{0}) = \frac{1}{∣∣ β ∣∣} (β^{⊺} x + β_{0} - = 0 (β_{0} + β^{⊺} x_{0})) = \frac{1}{∣∣ β ∣∣} (β^{⊺} x + β_{0}) = \frac{f ( x )}{∣∣ f ^{'} ( x ) ∣∣}$ where $d_{s} (\cdot)$ is a signed distance, $β^{*} := β /∣∣ β ∣∣$ , and $x_{0} \in H$

Link to original

Kernel Function
Definition

$K : R^{p} \to R, K (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j})⟩$ where $ϕ : R^{p} \to H$

The kernel function returns the result of Inner product in $H$ by using only the original input data.

Examples

Some common kernel functions $K (x_{i}, x_{j})$ are as follows:

Linear: $⟨ x_{i}, x_{j} ⟩$

Polynomial: $(⟨ x_{i}, x_{j} ⟩ + c)^{d} c \geq 0, n \geq 1$

Gaussian radial basis function: $exp (- γ ∣∣ x_{i} - x_{j} ∣ ∣^{2}), γ > 0$

Sigmoid: $tanh (κ ⟨ x_{i}, x_{j} ⟩ + c), κ > 0, c < 0$

Link to original

Support Vector Machine
Definition

Hard-Margin Support Vector Machine

Assume that we have a learning set $L = {(x_{i}, y_{i}) ∣ i = 1, 2, \dots, n}, x_{i} \in R^{p}, y_{i} \in {- 1, + 1}$ Suppose given two classes of data can separate by a Hyperplane without error. The hyperplane is called a separating hyperplane (SH)

Define $d_{-} := i min dist (SH, x_{i}), x_{i} \in {x_{i} ∣ y_{i} = - 1}$ , $d_{+} := i min dist (SH, x_{i}), x_{i} \in {x_{i} ∣ y_{i} = + 1}$ , and the margin of SH $d := d_{-} + d_{+}$ A SH which maximizes its margin ( $d$ ) is called an optimal separating hyperplane (OSH). To find OSH, set linear constraints $β_{0}$ and $β$ satisfy ${β_{0} + β^{⊺} x_{i} \geq + 1 β_{0} + β^{⊺} x_{i} \leq - 1 y_{i} = + 1 y_{i} = - 1 \Leftrightarrow y_{i} (β_{0} + β^{⊺} x_{i}) \geq + 1, i = 1, \dots, n$ where the minimum distance $1$ is arbitrary and may differ.

Let $H_{+} : β_{0} + β^{⊺} x_{i} = + 1$ and $H_{-} : β_{0} + β^{⊺} x_{i} = - 1$ . Then the points lying either on $H_{+}$ or $H_{-}$ are called support vectors. If $x_{+} \in H_{+}$ and $x_{-} \in H_{-}$ , then $d_{+} = d_{-} = \frac{1}{∣∣ β ∣∣}$ and $d = \frac{2}{∣∣ β ∣∣}$

Therefore, the OHS can be obtained by a Convex Optimization problem. It can be solved by Method of Lagrange Multipliers $L_{P} (β, β_{0}, α) = \frac{1}{2} ∣∣ β ∣ ∣^{2} - \sum_{i = 1}^{n} α_{i} [y_{i} (β_{0} + β^{⊺} x_{i}) - 1]$ with $0 \leq α$

By Duality of optimization problem, we have the dual optimization problem is defined as $max_{α} L_{D} (α) subject to 0 \leq α, α^{⊺} y = 0$ where $H = (y_{i} y_{j} x_{i}^{⊺} x_{j})$

And the dual Lagrangian function is defined as $L_{D} (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊺} x_{j} = 1_{n}^{⊺} α - \frac{1}{2} α^{⊺} H α$

The primal optimization problem is convex and satisfies KKT Conditions. Thus, holds Strong Duality and the solution of the dual problem is the same as the primal problem.

The optimal parameter is obtained as $\hat{β} = i \in sv \sum \overset{α}{^}_{i} y_{i} x_{i}, \hat{β}_{0} = \frac{1}{∣ sv ∣} i \in sv \sum (\frac{1 - y _{i} x _{i}^{⊺} β ^}{y _{i}})$ where $sv$ is an index set of support vectors

The optimal hyperplane can be written as $\hat{f} (x) = \hat{β}_{0} + \hat{β}^{⊺} x$ and the classification rule is given by $C (x) = sign (\hat{f} (x))$

Soft-Margin Support Vector Machine

$min \frac{1}{2} ∣∣ β ∣ ∣^{2} + C i = 1 \sum n ξ_{i} subject to y_{i} (β_{0} + β^{⊺} x_{i}) \geq + 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, \dots, n$ where $C$ is the regularization parameter, and $ξ$ is called a slack variable. If $ξ_{i} = 0$ , the point is out of margin. On the other hand, if $ξ_{i} > 0$ , then the point is within the margin.

The Lagrangian primal function is defined as $L_{P} (β, β_{0}, α, η) = \frac{1}{2} ∣∣ β ∣ ∣^{2} + C i = 1 \sum n ξ_{i} - \sum_{i = 1}^{n} α_{i} [y_{i} (β_{0} + β^{⊺} x_{i}) - (1 - ξ_{i})] - \sum_{i = 1}^{n} η_{i} ξ_{i}$ with $0 \leq α$ and $0 \leq η$ .

And the dual function is defined as $L_{D} (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊺} x_{j} = 1_{n}^{⊺} α - \frac{1}{2} α^{⊺} H α$ with $α^{⊺} y = 0$ and $0 \leq α \leq C 1_{n}$ where $H = (y_{i} y_{j} x_{i}^{⊺} x_{j})$

and the dual optimization problem is defined as $max_{α} L_{D} (α) subject to α^{⊺} y = 0, 0 \leq α \leq C 1_{n}$ The primal optimization problem is convex and satisfies KKT Conditions. Thus, holds Strong Duality and the solution of the dual problem is the same as the primal problem.

The optimal parameter is obtained as $\hat{β} = i \in sv \sum \overset{α}{^}_{i} y_{i} x_{i}, \hat{β}_{0} = \frac{1}{∣ sv ∣} i \in sv \sum (\frac{1 - y _{i} x _{i}^{⊺} β ^}{y _{i}})$ where $sv$ is an index set of support vectors.

Non-Linear Support Vector Machine

Non-linear SVM finds an optimal separating Hyperplane in high-dimensional feature space $H$ . It accomplished by the kernel trick. The kernel trick is that instead of computing inner products in $H$ , compute them using a non-linear Kernel Function in input space.

Hard-Margin Non-Linear Support Vector Machine

If the data can be separated in $H$ , then the dual optimization problem is defined as $max_{α} (1_{n}^{⊺} α - \frac{1}{2} α^{⊺} H α) subject to 0 \leq α, α^{⊺} y = 0$ where $H = (y_{i} y_{j} K (x_{i}, x_{j}))$

The optimal separating Hyperplane in the $H$ is $\hat{f} (x) = \hat{β}_{0} + i \in sv \sum \overset{α}{^}_{i} y_{i} K (x_{i}, x_{j})$

and the decision rule is defined as $C (x) = sign (\hat{f} (x))$

Soft-Margin Non-Linear Support Vector Machine

In the non-separable case, the dual problem is defined as $max_{α} (1_{n}^{⊺} α - \frac{1}{2} α^{⊺} H α) subject to 0 \leq α \leq C 1_{n}, α^{⊺} y = 0$ where $H = (y_{i} y_{j} K (x_{i}, x_{j}))$

The optimal separating Hyperplane in the $H$ is $\hat{f} (x) = \hat{β}_{0} + i \in sv \sum \overset{α}{^}_{i} y_{i} K (x_{i}, x_{j})$

and the decision rule is defined as $C (x) = sign (\hat{f} (x))$
Link to original

Support Vector Regression
Definition

Support vector regression (SVR) depends only on a subset of the training data, because the loss function $L_{ϵ} = (∣ y - f (x) ∣ - ϵ)_{+}$ ignores any training data close to the model prediction, within a band or tube.

The primal optimization problem is defined as
$\min &\left( \cfrac{1}{2}||\boldsymbol{\beta}||^{2} + C\sum\limits_{i=1}^{n}(\xi_{i} + \xi'_{i}) \right)\\ \text{subject to}\ &y_{i} - (\beta_{0} + \boldsymbol{\beta}^{\intercal}\mathbf{x}_{i}) \leq \epsilon + \xi'_{i},\\ &(\beta_{0} + \boldsymbol{\beta}^{\intercal}\mathbf{x}_{i}) - y_{i} \leq \epsilon + \xi_{i},\\ &\ \xi'_{i} \geq 0, \ \xi_{i} \geq 0\quad i=1,\dots,n \end{aligned}$$ The term $\cfrac{1}{2}||\boldsymbol{\beta}||^{2}$ in the objective function, which appears with the intention of maximizing the margin of SVM, acts as a regularization parameter in SVR. The Lagrangian primal function is defined as $$ \begin{aligned} L_{P} &= \frac{1}{2}||\boldsymbol{\beta}||^{2} + C\sum_{{i=1}}^{{n}} (\xi_{{i}} + \xi'_{{i}}) \\ &+ \sum_{{i=1}}^{{n}} \alpha'_{{i}} \left( y_{{i}} - (\beta_{0} + \boldsymbol{\beta}^{\intercal} \mathbf{x}_{{i}}) - \epsilon - \xi'_{{i}} \right) + \sum_{{i=1}}^{{n}} \alpha_{{i}} \left( (\beta_{0} + \boldsymbol{\beta}^{\intercal} \mathbf{x}_{{i}}) - y_{{i}} - \epsilon - \xi_{{i}} \right)\\ &- \sum_{{i=1}}^{{n}} \mu'_{{i}} \xi'_{{i}} - \sum_{{i=1}}^{{n}} \mu_{{i}} \xi_{{i}} \end{aligned} $$ with $\mathbb{\alpha}, \mathbb{\alpha}', \boldsymbol{\mu}, \boldsymbol{\mu}' \geq \mathbf{0}$ The dual optimization is defined as $$ \begin{aligned} \max\ &\left( \mathbf{y}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) - \epsilon \mathbf{1}^{\intercal} (\mathbf{\alpha} + \mathbf{\alpha'}) - \frac{1}{2} (\mathbf{\alpha} - \mathbf{\alpha'})^{\intercal} \mathbf{K} (\mathbf{\alpha} - \mathbf{\alpha'}) \right) \\ \text{subject to}\ &\mathbf{1}_{n}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) = 0, \mathbf{0} \leq \mathbf{a}, \mathbf{a}' \leq C\mathbf{1}_{n}\quad i=1,\dots,n \end{aligned}$$ And the dual Lagrangian function is defined as $$\begin{aligned} L_{D} = &\ \mathbf{y}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) - \epsilon \mathbf{1}^{\intercal} (\mathbf{\alpha} + \mathbf{\alpha'}) - \frac{1}{2} (\mathbf{\alpha} - \mathbf{\alpha'})^{\intercal} \mathbf{K} (\mathbf{\alpha} - \mathbf{\alpha'}) \\ &+ \lambda \mathbf{1}_{n}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'})\\ &- \boldsymbol{\mu}^{\intercal} \mathbf{\alpha} - \boldsymbol{\mu}'^{\intercal} \mathbf{\alpha'}+ \boldsymbol{\mu}_{C}^{\intercal} (C\mathbf{1}_{n} - \mathbf{\alpha}) + \boldsymbol{\mu}_{C}'^{\intercal} (C\mathbf{1}_{n} - \mathbf{\alpha'}) \end{aligned}$$ with $\boldsymbol{\mu}, \boldsymbol{\mu}', \boldsymbol{\mu}_{C}, \boldsymbol{\mu}_{C}' \geq \mathbf{0}$ where $\mathbf{K}$ is a [[Kernel Function]].$ Link to original

Ensemble Learning

Bootstrap Aggregating
Definition

Regression Setting

Given a training set $Z = {(x_{i}, y_{i}) ∣1 = 1, \dots, n}$ , draw a bootstrap sample ${Z^{* b} ∣ b = 1, \dots, B}$ from training data. Bagging estimator of $f$ at $x$ is defined as $\hat{f}_{bag} (x) = \frac{1}{B} b = 1 \sum B \hat{f}^{* b} (x)$ where $\hat{f}^{* b} (x)$ is the prediction at $x$ based on a new training set $Z^{* b}$

Classification Setting

Given a training set $Z = {(x_{i}, y_{i}) ∣1 = 1, \dots, n}$ , draw a bootstrap sample ${Z^{* b} ∣ b = 1, \dots, B}$ from training data. Bagging estimator of $f$ at $x$ is defined as $\hat{f}_{bag} (x) = 1 \leq k \leq K argmax # {b ∣ \hat{f}^{* b} (x) = k}$ where $\hat{f}^{* b} (x)$ is the prediction at $x$ based on a new training set $Z^{* b}$

Facts

Bagging in a classification setting with zero-one loss often fails. Bagging a good classifier can make it better, but bagging a bad classifier can make it worse.

Bagging reduces the variance of an estimated prediction function. It seems to work especially well for high-variance, low-bias procedures, such as trees.

Link to original

Boosting
Definition

Boosting builds a strong learner by combining many weak learners whose accuracies are slightly better than a random guess.

Examples

AdaBoost
Link to original

AdaBoost
Definition

The AdaBoost (Adaptive Boosting) is the weighted sum of weak learners that are robust to outliers and noise.

Algorithm

Consider a 2-class problem with $Y \in {- 1, 1}$

Initialize the weights $w_{i} = \frac{1}{n}, i = 1, 2, \dots, n$

Repeat for $m = 1, 2, \dots, M$ :

Fit a weak classifier $δ$ that minimizes the weighted error rate $err = i = 1 \sum n w_{i} I (y_{i} \neq = δ (x_{i}))$ and call the fitted classifier $f_{m}$ , and its corresponding error rate $err_{m}$

Compute $c_{m} = ln (\frac{1 - err _{m}}{err _{m}}) = logit (1 - err_{m})$

Update the weights $w_{i}$ by $w_{i} = \frac{w _{i} e x p [ c _{m} I ( y _{i} \neq = f _{m} ( X _{i} ))]}{j = 1 \sum n w _{j} e x p [ c _{m} I ( y _{j} \neq = f _{m} ( X _{j} ))]}$

Output the classifier $f (x) = sgn (m = 1 \sum M c_{m} f_{m} (x))$

Facts

AdaBoost is equivalent to Gradient Boosting using the exponential loss $L (y, f (x)) = exp (- y f (x))$
Link to original

Gradient Boosting
Definition

The gradient boosting is a type of Forward Stagewise Additive Modeling that uses gradients to calculate the residuals.

Algorithm

Initialize the model $f_{0} (x) = γ argmin i = 1 \sum n L (y_{i}, γ)$

Repeat for $m = 1, 2, \dots, M$ :

Calculate the residuals for each data point $r_{im} = - [\frac{\partial L ( y _{i} , f ( x _{i} ))}{\partial f ( x _{i} )}]_{f = f_{m - 1}}$ where $L$ is a Loss Function

Fit weak learners to the residuals $h_{m} (x) = h argmin i = 1 \sum n (r_{im} - h (x_{i}))^{2}$

Calculate a learning rate $γ_{m} = γ argmin i = 1 \sum n L (y_{i}, f_{m - 1} (x_{i}) + γ h_{m} (x_{i}))$

Update model using the fitted weak learners $f_{m} (x) = f_{m - 1} (x) + γ h_{m} (x)$

Output the final model $\hat{f} (x) = f_{0} (x) + i = 1 \sum m γ_{m} h_{m} (x)$

Link to original

Forward Stagewise Additive Modeling
Definition

The forward stagewise additive modeling sequentially adds new basis functions to the model of the previous step without adjusting the parameters and coefficients of those that have already been added.

Algorithm

Initialize $f_{0} (x) = 0$

For $m = 1, 2, \dots, M$

Compute $(β_{m}, γ_{m}) = β, γ argmin i = 1 \sum n L (y_{i}, f_{m - 1} (x_{i}) + β b (x_{i}; γ))$

Set $f_{m} (x) = f_{m - 1} (x) + β_{m} b (x; γ_{m})$ where $b (x; γ)$ is a basis function and $L (y, f (x))$ is a Loss Function

Link to original

Random Forest
Definition

Random forest constructs a multitude of de-correlated decision trees at training time and make a prediction through Bagging.

Algorithm

For $b = 1, \dots, B$ :

Draw a size $n$ bootstrap sample ${Z^{* b} ∣ b = 1, \dots, B}$ from training data.

Grow a random-forest tree $T_{b}$ to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size $nmin$ is reached.

Select $m << p$ variables at random.

Pick the best variable and split point among the selected variables.

Split the node into two children nodes.

Output the ensemble of trees ${T_{b}}_{1}^{B}$ to make a prediction at a new point $x$

Regression: $\hat{f}_{rf}^{B} (x) = \frac{1}{B} b = 1 \sum B T_{b} (x)$

Classification: $C_{rf}^{B} (x) = 1 \leq b \leq B argmax # {b ∣ \hat{C}_{b} (x) = b}$ where $\hat{C}_{b} (x)$ is the class prediction of the $b$ th random-forest tree.

Facts

Selection of hyperparameter $m$ is recommended as

For classification: $min (⌊ p ⌋, 1)$

For regression: $min (⌊ \frac{p}{3} ⌋, 1)$

Link to original

Neural Networks

Projection Pursuit Regression
Definition

$f (x) = \sum_{m = 1}^{M} f_{m} (w_{m}^{⊺} x) + ϵ$ where $f_{m}$ are unspecified function

Facts

PPR is a single-layered Neural Network where each neuron uses a different activation function and has no bias term.

Link to original

Activation Function
Definition

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on the linear combination of its inputs. It is used to add a non-linearity to the model.

Examples

Logistic Function
Definition

$σ (x) = logistic (x) = logit^{- 1} (x) = \frac{1}{1 + e x p ( - x )}$

The logistic function is inverse function of Logit.

Facts

Sigmoid activation function is vulnerable to vanishing gradient problem. $\frac{d}{d x} σ (x) = σ (x) (1 - σ (x))$ The image of the derivative of the sigmoid function is $(0, 0.25]$ . For this reason, after passing node with sigmoid Activation Function, the gradient is decreased

Also, with the sigmoid Activation Function, if all the inputs are positive, then all the gradients also positive.

Link to original

Hyperbolic Tangent Function
Definition

$tanh (x) = \frac{sinh x}{cosh x} = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}$
Link to original

Rectified Linear Unit Function
Definition

$f (x) = max (0, x)$

Facts

If an initial value is negative, it is never updated.

Link to original

ReLU6
Definition

$f (x) = min (max (0, x), 6)$
Link to original

Gaussian-Error Linear Unit
Definition

GELU is a smooth approximation of ReLU.

$f (x) = x Φ (x)$ where $Φ$ is the CDF of the standard normal distribution.
Link to original

Parametric ReLU
Definition

$f (x) = max (αx, x)$ where $α \leq 1$ is a hyperparameter

Facts

If $α = 0.01$ , it is called a Leaky ReLU

Link to original

Exponential Linear Unit
Definition

$f (x) = {x α (e^{x} - 1) x \geq 0 x < 0$ where $α \geq 0$ is a hyperparameter
Link to original

Swish Function
Definition

$f (x) = x σ (β x) = \frac{x}{1 + e ^{- β x}}$ where $σ$ is Sigmoid Function, and $β$ is a hyperparameter

When $β = 1$ , the function is called a sigmoid liniear unit (SiLU).
Link to original
Link to original

Sum of Squared Errors Loss
Definition

Suppose the number of data is $n$ , then the sum of squared error loss (SSE) is defined as the sum of Squared Error Loss of all data. $R (θ) = i = 1 \sum n (y_{i} - f (x_{i}))^{2}$
Link to original

Cross-Entropy Loss
Definition

Suppose the number of data is $n$ and the number of classes is $K$ , then the cross entropy loss is defined as $R (θ) = - i = 1 \sum n k = 1 \sum K y_{ik} ln f_{k} (x_{i})$
Link to original

Neural Network
Definition

Neural network can be thought as a non-linear generalization of linear model.

The derived features $Z_{m}$ are constructed by an Activation Function and linear combinations of the inputs. $Z_{m} = σ (α_{0 m} + α_{m}^{⊺} X), m = 1, \dots, M$ where $σ$ is an Activation Function

Output nodes are the linear combinations of $Z$ $T_{k} = β_{0 k} + β_{k}^{⊺} Z, k = 1, \dots, K$ And the output is modeled by a function of a linear combinations of $Z_{m}$ $f_{k} (X) = g_{k} (T), k = 1, \dots, K$ where $g_{k} (T)$ is called an output function.

Facts

The output function $g_{k} (T)$ varies by the problem. For regression $g_{k}$ is Identity Function, and for $k$ -class classification Softmax Function $σ$ is used as the $g_{k}$ .

For regression problem, Sum of Squared Errors Loss is used as Loss Function. For classification problem, we use Cross-Entropy Loss

With the softmax activation function and the Cross-Entropy Loss, the neural network model is exactly a linear Logistic Regression model in the hidden units.

The parameters of a neural network are estimated by Backpropagation.

Neural network is especially effective in problem with a high signal-to-noise ratio.

Link to original

Dropout
Definition

Dropout is a regularization technique used for Neural Network. Dropout randomly dropping out or omitting units during training process of a Neural Network.
Link to original

Clustering Analysis

K-Means Clustering
Definition

The k-means clustering algorithm solves the following optimization problem $S min k = 1 \sum K x_{i} \in S_{i} \sum ∣∣ x_{i} - m_{k} ∣ ∣^{2}$ where $K$ is the number of clusters, a Partition $S = {S_{k} ∣ k = 1, \dots, K}$ is a set of clusters, and $m_{k} = \frac{1}{∣ S _{k} ∣} x_{i} \in S_{i} \sum x_{i}$ The k-means clustering algorithm aims to partition observations into clusters in which each observation belongs to the cluster with the nearest mean.

Algorithm

Initialize ${m_{i}^{(1)} ∣ i = 1, \dots, K}$ .

Assignment step: Assign each observation to the cluster with the nearest mean

$S_{i}^{(t)} = {x_{p} ∣ 1 \leq j \leq K argmin ∣∣ x_{p} - m_{j}^{(t)} ∣ ∣^{2}}$ where each $x_{p}$ is assigned to exactly one $S_{i}^{(t)}$

Update step: Recalculate means for observations assigned to each cluster. $m_{i}^{(t + 1)} = \frac{1}{∣ S _{i}^{(t)} ∣} x_{j} \in s_{i}^{(t)} \sum x_{j}$

Iterate step 2 and 3 until the within cluster variation converges.

Facts

The k-means clustering uses spherical metric (Euclidean distance) to group data, so that it is useful only when clusters are convex sets.
Link to original

Self-Organizing Map
Definition

A self-organizing map (SOM) is a clustering method that produces a low-dimensional representation of a higher-dimensional data set while preserving the Topological Manifold structure of the data.

The algorithm fits a grid to high-dimensional data and assigns the data to the fitted nodes (prototypes) of the grid.
Link to original

Spectral Clustering
Definition

Spectral clustering technique make use of spectrum of the similarity matrix of the data to perform dimensionality reduction before clustering.

Algorithms

Apply Laplacian Eigenmap to the given data $x_{1}, x_{2}, \dots, x_{n} \in R^{l}$ . $x_{i} \in R^{l} \to [z_{2}, z_{3}, \dots, z_{m}]_{i} \in R^{m}$

Form an $n \times k$ sub matrix $T$ using the first $k$ -columns of the embedded matrix $Z$ , and normalize each row to norm $1$ . $T = [z_{1}, z_{2}, \dots, z_{k}]$

Apply clustering algorithm (e.g. K-Means Clustering) to $t_{1}, t_{2}, \dots, t_{n}$ , where $t_{i}$ is the $i$ -th row of $T$ .

Mathematical Background

Consider a cluster assignment vector $z_{i}$ for each data point. Then, clustering $n$ observations is corresponding to estimating $z_{i}$ ‘s The entries $a_{ij}$ of the Adjacency Matrix represents the similarity between points $i$ and $j$ .

Under the assumption that the close data points (large $a_{ij}$ ) have a similar label ( $z_{i} \approx z_{j}$ ), the optimization problem is set up that minimizing the difference between assignments for similar points. $Z argmin \frac{1}{2} i = 1 \sum n j = 1 \sum n a_{ij} (z_{i} - z_{j})^{2} subject to Z^{⊺} Z = I$ where the constraint is imposed to avoid the trivial solution $z_{i} = 0$ .

In a matrix notation, the problem is $Z argmin tr (Z^{⊺} LZ) subject to Z^{⊺} Z = I$

When $z_{i} \approx z_{j}$ , $a_{ij}$ is large and when $z_{i} \neq = z_{j}$ , $a_{ij}$ is small. So, $a_{ij}$ works as the weight for each pair in the optimization.

The Lagrangian function is defined as $L_{P} = Z^{⊺} LZ - Λ (Z^{⊺} Z - 1)$ And the solution of the problem is given by an Eigenvalue problem. $\frac{\partial L _{p}}{\partial Z} = 2 LZ - 2Λ Z = 0 \Rightarrow LZ = Λ Z \Rightarrow Z^{⊺} LZ = Λ$ The solution $Z$ is an $n \times n$ matrix of eigenvectors sorted in ascending by their corresponding eigenvalues.
Link to original

Procrustes Analysis
Definition

Procrustes Transformation without Scaling

Consider two $n \times p$ matrices $X_{1}$ and a target $X_{2}$ . What we want is to find a Procrustes transformation of $X_{1}$ whose result is closest to $X_{2}$ ,

The optimization problem is defined as $μ, R min ∣∣ X_{2} - (X_{1} R + 1_{n} μ) ∣ ∣_{F}$ where $R$ is a $p \times p$ Orthonormal Matrix (perform rotation and reflection), $μ_{1 \times p}$ act as location parameter, and $∣∣ \cdot ∣ ∣_{F}$ is a Frobenius Norm.

Let $\tilde{x}_{1}$ and $\tilde{x}_{2}$ be the columnwise mean vectors of the matrices, and $\tilde{X}_{1}$ and $\tilde{X}_{2}$ be the demeaned matrices. Consider the SVD $\tilde{X}_{1}^{⊺} \tilde{X}_{2} = UD V^{⊺}$ . Then, the solution of the optimization problem is given by $\hat{R} = U V^{⊺}, \overset{μ}{^} = \tilde{x}_{2} - \hat{R} \tilde{x}_{1}$ and the minimal distance is referred to as the Procrustes distance.

Procrustes Transformation with Scaling

Consider demeaned matrices $X_{1}$ and $X_{2}$ . The Procrustes distance with scaling is obtained from more general optimization problem. $β, R min ∣∣ X_{2} - β X_{1} R ∣ ∣_{F}$ where $β > 0$ is a positive scalar.

The solution for $R$ is as before ( $\hat{R} = U V^{⊺}$ ), with $\hat{β} = tr (D) /∣∣ X_{1} ∣ ∣_{F}^{2}$

Procrustes Average

Consider demeaned and scaled matrices ${X_{l} ∣ l = 1, 2, \dots, L}$ . Procrustes average problem finds the shape $M$ closest in average squared Procrustes distance to all the given shapes. ${R_{l}}_{1}^{L}, M min ∣∣ X_{l} R_{l} - M ∣ ∣_{F}^{2}$ This is solved by a simple algorithm

Initialize $M = X_{1}$

Solve the $L$ Procrustes rotation problems with $M$ fixed, yielding $X_{l}^{'} \leftarrow X \hat{R}_{l}$

Let $M \leftarrow \frac{1}{L} l = 1 \sum L X_{l}^{'}$

Iterate step 1 and 2 until it converges.

Link to original

Kernel Principal Component Analysis
Definition

Consider an $n \times p$ demeaned matrix $X$ By SVD $X = UD V^{⊺}$ , where $U, V$ are orthonormal matrices and $D$ is a Diagonal Matrix. Then, $X X^{⊺} = UD V^{⊺} (UD V^{⊺})^{⊺} = U D^{2} U^{⊺}$ By PCA $Z := XV = UD$ . For a linear kernel $K = X X^{⊺}$ , $K = Z D U^{⊺} \Rightarrow K (D U^{⊺})^{- 1} = Z$ . $∵ U$ is an Orthonormal Matrix, $Z = KU D^{- 1} \Leftrightarrow z_{im} = j = 1 \sum n \frac{u _{jm}}{d _{m}} K (x_{i}, x_{j})$ The kernel principal components are given by solution of the optimization problem. $g_{i} \in H_{K} max Var g_{i} (X) subject to ∣∣ g_{i} ∣ ∣_{H_{K}} = 1, \forall j < i, ⟨ g_{i}, g_{j} ⟩_{H_{K}} = 0$ where $g_{i} (x) = j = 1 \sum n \frac{u _{ji}}{d _{i}} K (x, x_{j})$
Link to original

My Knowledge Base

Explorer

The Elements of Statistical Learning Note

Preliminaries

Matrix Algebra

Orthonormal Matrix

Definition

Examples

Facts

Eigendecomposition

Definition

Characteristic Polynomial

Eigenspace

Algebraic Multiplicity

Geometric Multiplicity

Computation

Facts

Projection Matrix

Definition

Facts

QR Decomposition

Definition

Computation

Using the Gram Schmidt Orthonormalization

Cholesky Decomposition

Definition

Computation

Spectral Theorem

Definition

Facts

Singular Value Decomposition

Definition

Calculation

Facts

Visualization

Non-Negative Matrix Factorization

Definition

Model Assessment and Selection

Bias-Variance Decomposition

Definition

Mallow's Cp

Definition

Akaike Information Criterion

Definition

Bayesian Information Criterion

Definition

Facts

Vapnik–Chervonenkis Dimension

Definition

Examples

Cross Validation

Definition

Facts

Leave-One-Out-Cross-Validation

Definition

Generalized Croos Validation

Maximum Likelihood Methods

Maximum Likelihood Estimation

Definition

Regularity Conditions

Properties

Functional Invariance

Consistency

Asymptotic Normality

Asymptotic Confidence Interval

Delta method for MLE Estimator

Facts

The EM Algorithm

Expectation-Maximization Algorithm

Definition

Bayesian Methods

Bayes Risk

Definition

Posterior Risk

Definition

Bayes Estimator

Definition

Facts

Monte Carlo Integration

Definition