Preliminaries

Matrix Algebra

Orthonormal Matrix

Definition

The matrix whose column and rows are orthonormal vectors.

Examples

Facts

Every orthonormal matrix is Unitary Matrix.

For a vector , which can be expressed as a linear combination of column vectors of the orthonormal matrix , we can easily find the coefficients of the linear combination using the Inner product.

Link to original

Eigendecomposition

Definition

where is a matrix and

If is a scalar multiple of non-zero vector , then is the eigenvalue and is the eigenvector.

Characteristic Polynomial

The values satisfying the characteristic polynomial, are the eigenvalues of the matrix

Eigenspace

The set of all eigenvectors of corresponding to the same eigenvalue, together with the zero vector.

The Kernel of the matrix

Eigenvector: non-zero vector in the eigenspace

Algebraic Multiplicity

Let be an eigenvalue of an matrix . The algebraic multiplicity of the eigenvalue is its multiplicity as a root of a Characteristic Polynomial, that is, the largest k such that

Geometric Multiplicity

Let be an eigenvalue of an matrix . The geometric multiplicity of the eigenvalue is the dimension of the Eigenspace associated with the eigenvalue.

Computation

  1. Find the solution^[eigenvalues] of the Characteristic Polynomial.
  2. Find the solution^[eigenvectors] of the Under-Constrained System using the found eigenvalue.

Facts

There exists at least one eigenvector corresponding to the eigenvalue

Eigenvectors corresponding to different eigenvalues are always linearly independent.

When is a normal or real Symmetric Matrix, the eigendecomposition is called Spectral Decomposition

The Trace of the matrix is equal to the sum of the eigenvalues of the matrix.

Proof Since the matrix only has term on diagonal, and the calculation of cofactor deletes a row and a column, the coefficient of of is always . Also, since are the solution of the Characteristic Polynomial, the expression is factorized as and the coefficient of become a Therefore, and .

The Determinant of the matrix is equal to the product of the eigenvalues of the matrix.

Not all matrices have linearly independent eigenvectors.

When holds,

  • the eigenvalues of are and the eigenvectors of the matrix are the same as .
  • the eigenvalues of is
  • the eigenvalues of is
  • the eigenvalues of is
  • the eigenvalues of is

An eigenvalue’s Geometric Multiplicity cannot exceed its Algebraic Multiplicity

the matrix is diagonalizable the Geometric Multiplicity is equal to the Algebraic Multiplicity for all eigenvalues

Let be a symmetric matrix. Then, has eigenvalues equal to and the rest zero .

Link to original

The non-zero eigenvalues of are the same as those of .

Link to original

Projection Matrix

Definition

For some vector , is the projection of onto

Facts

The projection matrix is symmetric Idempotent Matrix

Consider a Symmetric Matrix , then is idempotent with rank if and only if eigenvalues are and eigenvalues are .

The projection matrix is Positive Semi-Definite Matrix

If and are projection matrices, and is Positive Semi-Definite Matrix, then

  • is a projection matrix.
Link to original

QR Decomposition

Definition

Decomposition of matrix into a product of an orthonormal matrix and an upper triangular matrix .

Computation

Using the Gram Schmidt Orthonormalization

QR decomposition can be computed by Gram Schmidt Orthonormalization. Where is the matrix of orthonormal column vectors obtained by the orthonormalization and

Link to original

Cholesky Decomposition

Definition

Decomposition of a Positive-Definite Matrix into the product of lower triangular matrix and its Conjugate Transpose.

Computation

Let be Positive-Definite Matrix. Then, By setting the first-row first-column entry , can find other entries using substitution. By summarizing, ,

Link to original

Spectral Theorem

Definition

where is a Unitary Matrix, and

A matrix is a Hermitian Matrix if and only if is Unitary Diagonalizable Matrix.

Facts

Every Hermitian Matrix is diagonalizable, and has real-valued Eigenvalues and orthonormal eigenvector matrix.

For the Hermitian Matrix , the every eigenvalue is real.

Proof For the eigendecomposition , . Since is hermitian, and norm is always real. Therefore, every eigenvalue is real.

For the Hermitian Matrix , the eigenvectors from different eigenvalues are orthogonal.

Proof Let the eigendecomposition where . By the property of hermitian matrix, and So, . Since , . Therefore, are orthogonal.

Link to original

Singular Value Decomposition

Definition

An arbitrary matrix can be decomposed to .

For where and and is Diagonal Matrix

For where and and is Diagonal Matrix

Calculation

For matrix

is the matrix of orthonormal eigenvectors of is the matrix of orthonormal eigenvectors of is the Diagonal Matrix made of the square roots of the non-zero eigenvalues of and sorted in descending order.

If the eigendecomposition is , then the eigenvalues and the eigenvectors are orthonormal by Spectral Theorem. If the , then . Where are called the singular values.

Now, the Orthonormal Matrix is calculated using the singular values. Where is the Orthonormal Matrix of eigenvectors corresponding to the non-zero eigenvalues and is the Orthonormal Matrix of eigenvectors corresponding to zero eigenvalues where each , and is a rectangular Diagonal Matrix.

Since is an orthonormal matrix and is a Diagonal Matrix, . Now, the Orthonormal Matrix is calculated using the linear system and the null space of , Where is the Orthonormal Matrix of the vectors obtained from the system equation And is the Orthonormal Matrix of the vectors . which is corresponding to zero eigenvalues

The Orthonormal Matrix can also be formed by the eigenvectors of similarly to calculating of .

Facts

are the Unitary Matrix

Let be a real symmetric Positive Semi-Definite Matrix Then, the Eigendecomposition(Spectral Decomposition) of and the singular value decomposition of are equal.

where and are non-negative and the shape of the every matrix is

Visualization

every matrix can be decomposed as a

  • : rotation and reflection
  • : scaling
  • : rotation and reflection
Link to original

Non-Negative Matrix Factorization

Definition

where

Non-negative matrix factorization (NMF) is an algorithm where a matrix is factorized into two matrices and have no negative elements.

The matrices and are found by maximizing

Link to original

Model Assessment and Selection

Bias-Variance Decomposition

Definition

Assume that where and . Then, under squared error loss.

Link to original

Mallow's Cp

Definition

where is the number of explanatory variables, is a sample variance under the full model, and

Mallow’s is used to assess the fit of a regression model. A small value of means that the model is relatively precise.

Link to original

Akaike Information Criterion

Definition

where is the MLE of the postulated model’s parameter

Link to original

Bayesian Information Criterion

Definition

Facts

BIC more penalize than AIC when is large

Link to original

Vapnik–Chervonenkis Dimension

Definition

Vapnik–Chervonenkis (VC) dimension is measure of the size of a class of sets. The VC dimension of the class is the largest number of points that can be shattered (can always learn a perpect classifier for any labeling) by members of .

Examples

  • The VC dimension of linear indicator functions in the plane is 3.
  • has infinite VC dimension.
Link to original

Cross Validation

Definition

Partition a set of data into sets and denote a function such that . For the dataset , let be the estimator based on observations except , then the cross validation estimation for the prediction error is defined as where is a loss function

Facts

If the data is partitioned into group with equal size, then it is called a -fold cross validation.

Link to original

Leave-One-Out-Cross-Validation

Definition

Cross Validation with

Link to original

Generalized Croos Validation

Convenient approximation to Leave-One-Out-Cross-Validation of a linear model ,

Link to original

Maximum Likelihood Methods

Maximum Likelihood Estimation

Definition

MLE is the method of estimating the parameters of an assumed Distribution

Let be Random Sample with PDF , where , then the MLE of is estimated as

Regularity Conditions

  • R0: The pdfs are distinct, i.e.
  • R1: The pdfs have same supports
  • R2: The true value is an interior point in
  • R3: The pdf is twice differentiable with respect to
  • R4:
  • R5: The pdf is three times differentiable with respect to , , and interior point

Properties

Functional Invariance

If is the MLE for , then is the MLE of

Consistency

Under R0 ~ R2 Regularity Conditions, let be a true parameter, is differentiable with respect to , then has a solution such that

Asymptotic Normality

Under the R0 ~ R5 Regularity Conditions, let be Random Sample with PDF , where , be a consistent Sequence of solutions of MLE equation , and , then where is the Fisher Information.

By the asymptotic normality, the MLE estimator is asymptotically efficient under R0 ~ R5 Regularity Conditions

Asymptotic Confidence Interval

By the asymptotic normality of MLE, Thus, confidence interval of for is

Delta method for MLE Estimator

Under the R0 ~ R5 Regularity Conditions, let be a continuous function and , then

Facts

Under R0 and R1 regularity conditions, let be a true parameter, then

Link to original

The EM Algorithm

Expectation-Maximization Algorithm

Definition

Let be an observed data, be an unobserved (latent) variable, are independent, be a joint pdf of , be a joint pdf of , be a conditional pdf of given

By the definition of a conditional pdf, we have the identity

The goal of the EM algorithm is maximizing the observed likelihood using the complete likelihood .

Using the definition conditional pdf, we derive the identity for an arbitrary but fixed

&= \int \ln[h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})] k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0}) d \mathbf{Z} - \int \ln[k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})]k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})d\mathbf{Z}\\ &= E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}] - E_{\boldsymbol{\theta}_{0}}[\ln k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}) | \boldsymbol{\theta}_{0}, \mathbf{X}] \end{aligned}$$ Let the first term of RHS be a quasi-likelihood function $$Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X}) := E_{\boldsymbol{\theta}_{0}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \boldsymbol{\theta}_{0}, \mathbf{X}]$$ EM algorithm maximizes $Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X})$ instead of maximizing $\ln L(\boldsymbol{\theta}|\mathbf{X})$ # Algorithm 1. Expectation Step: Compute $$Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X}) := E_{\hat{\boldsymbol{\theta}}^{(m)}}[\ln L^{c}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Z}) | \hat{\boldsymbol{\theta}}_{m}, \mathbf{X}]$$ where the $m = 0, 1, \dots$, and the expectation is taken under the conditional pdf $k(\mathbf{Z} | \mathbf{X}, \hat{\boldsymbol{\theta}}^{(m)})$ 2. Maximization Step: $$\hat{\boldsymbol{\theta}}^{(m+1)} = \underset{\boldsymbol{\theta}}{\operatorname{arg max}} Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X})$$ # Properties ## Convergence The [[Sequence]] of estimates $\hat{\boldsymbol{\theta}}^{(m)}$ satisfies $$L(\hat{\boldsymbol{\theta}}^{(m+1)}|\mathbf{X}) \leq L(\hat{\boldsymbol{\theta}}^{(m)}|\mathbf{X})$$ Therefore the [[Sequence]] of EM estimates converge to (at least local) optimalLink to original

Bayesian Methods

Bayes Risk

Definition

r(\theta, \delta) &= \int_{\Theta} R(\theta, \delta) \pi(\theta) d\theta = E_\theta[R(\theta, \delta)] = E_\theta[E_{x}[L(\theta, \delta(x))]]\\ &= \int_{\Theta} \int_{X} L(\theta, \delta(x))p(x|\theta) \pi(\theta) dx d\theta = \int_{X} \int_{\Theta} L(\theta, \delta(x))p(\theta|x)p(x) d\theta dx\\ &= \int_{X}E_\theta[L(\theta, \delta(x))|X=x]p(x)dx = \int_{X}\rho(x, \pi)p(x)dx \end{aligned}$$ where $L(\theta, \delta)$ is [[Loss Function]], $R(\theta, \delta)$ is [[Risk Function]], and $\rho(x, \pi)$ is a [[Posterior Risk]].Link to original

Posterior Risk

Definition

where is Loss Function, is Risk Function, and is a posterior probability

Link to original

Bayes Estimator

Definition

Estimator that minimizes the Bayes Risk or Posterior Risk

Facts

Under Squared Error Loss, the Bayes estimator is a posterior mean, and a posterior mode under Absolute Error Loss.

Consider a regression model and the prior distribution for , . Then, the Bayes estimator under the Squared Error Loss is obtained as If for some and , then the Bayes estimator is the same as ridge estimator If and , then the Bayes estimator is the James-Stein regression estimator

Link to original

Monte Carlo Integration

Definition

where and is a PDF

Link to original

Gibbs Sampling

Definition

A MCMC algorithm for sampling from a specified Multivariate Distribution when direct sampling from the joint distribution is difficult, but sampling from the Conditional Distribution is more practical.

Algorithm

  1. Begin with some initial value
  2. For generate from
  3. Repeat the above step times.

Examples

Consider a bivariate random variable , and suppose we want to obtain samples from the marginals . Take samples from the conditional distributions and . As , the samples are eventually samples from and respectively.

Facts

It is common to discard some number of samples at the beginning (burn-in period).

Link to original

Regression

Parametric Regression

Simple Linear Regression

Definition

where ‘s are i.i.d. error terms, with and

Link to original

Multiple Linear Regression

Definition

where ‘s are i.i.d. error terms, with and

Matrix Notations

Let and , then the regression model can be express the model as

Let , , , then the regression model can be express as where , and

Link to original

Ridge Regression

Definition

where is a complexity parameter that controls the amount of shrinkage.

Ridge regression is particularly useful to mitigate the problem of Multicollinearity in linear regression

Facts

where by Singular Value Decomposition

Link to original

Lasso Regression

Definition

Lasso model assume that the coefficients of the model are sparse.

Link to original

Elastic Net Regularization

Definition

Elastic net is a regularized regression method that linearly combined the penalties of the lasso and ridge regressions.

Link to original

Principal Component Analysis

Definition

PCA is a linear dimensionality reduction technique. The correlated variables are linearly transformed onto a new coordinate system such that the directions capturing the largest variance in the data.

Population Version

Given a random vector , we find a such that is maximized: Equivalently, by the Method of Lagrange Multipliers with , By differentiation, the is given by the eigen value problem Thus the maximizing the variance of is the eigenvector corresponding to the largest Eigenvalue.

Sample Version

Given a data matrix , by Singular Value Decomposition, A matrix can be factorized as . By algebra, , where we call the -th principal component.

Facts

Since and

Link to original

Partial Least Squares Regression

Definition

where is the sample covariance matrix.

Unlike PCR maximizing only, PLS finds directions maximizing both and

Link to original

Grouped Lasso

where is the number of groups, is the number of variables within the group , and is Euclidean Norm

In grouped lasso, the variables in each group share a regularization parameter.

Link to original

nonparametricNonparametric Regression

Piecewise Polynomials

Definition

Suppose that knots are . Then, the basis functions for order-M spline are defined as: where

Examples

For cubic Polynomials with two knots , the basis functions are:

Facts

Order of piecewise polynomials

  • constant: 1
  • linear: 2
  • quadratic: 3
  • cubic: 4

Link to original

Natural Cubic Splines

Definition

Suppose that knots are . Then, the basis functions for natural cubic Spline are defined as: where

Link to original

Smoothing Splines

Definition

where is a smoothing parameter. If , the smoothing spline is any interpolating spline. If , the smoothing spline is Least Square line.

Smoothing splines is a Spline basis method that avoids the knot selection problem. Among all functions , find one that minimizes the penalized RSS.

Facts

The unique analytic solution of smoothing spline is a Natural Cubic Splines with knots at the unique values of the ‘s

Link to original

Classification

Introduction

Bayes Classifier

Definition

where is conditional PDF of given and is prior

Properties

Optimality of Bayes Rule

Bayes classifier minimizes the prediction risk over all classifiers

Facts

For a given , the Bayes classifier is deterministic, i.e. it does not depend on the training sample. It can not be used in practice since is unknown.

Bayes classifier maximizes the posterior pdf.

Link to original

Parametric Classifier

Linear Discriminant Analysis

Definition

where is the sample ratio, is the sample mean, and is the sample variance covariance matrix

LDA is a Bayes Classifier with an assumption that

Facts

LDA assume homogeneity of variance-covariance among classes. Quadratic Discriminant Analysis may be used when the covariances are not equal.

Link to original

Quadratic Discriminant Analysis

Definition

where is the sample ratio, is the sample mean, and is the sample variance covariance matrix

Bayes Classifier with an assumption that

Link to original

Logistic Regression

Definition

Logistic regression is a Generalized Linear Model with Logit link function. Logistic regression models the log-odds of an event as a linear combination of independent variables. The probability is calculated as

Link to original

Nonparametric Classifier

Kernel Density Estimation

Definition

where is the kernel, is the scaled kernel, and is a smoothing parameter (bandwidth, or window width)

KDE is a non-parametric method to estimate PDF of a random variable based on kernel and wrights.

Multidimensional KDE

where is the kernel, is the scaled kernel, and is a positive-definite bandwidth matrix.

Facts

Widely used kernels

Link to original

Kernel Based Classifier

Definition

where and

Bayes Classifier with an assumption that is the Kernel Density Estimation.

Link to original

K-Nearest Neighbors Algorithm

Definition

Suppose is the distance between the training sample and the given point . Let

The k-NN determines a sample’s class using the -nearest training datum.

Facts

and are hyperparameters.

Link to original

Decision Tree

Definition

Decision tree consists of Classification Tree and Regression Tree.

Link to original

Regression Tree

Definition

Decision tree algorithm consists of growing and pruning step. Suppose

  1. growing: seek the splitting variable and split point that solves
  2. pruning: Suppose the number of terminal nodes in a tree is and is the split regions corresponding to the terminal node. For a tree , define the cost complexity function. where is a tuning parameter regularizing complexity.

Regression function is estimated by A non-parametric regression that partitions explanatory variables into a set of rectangles and takes the average of the response variables in each rectangle as an estimate of the regression function in that region.

Link to original

Classification Tree

Definition

Suppose , and Define a variable representing the proportion of class observations in a node We classify the observations in a node to class Suppose the number of terminal nodes in a tree is and is the split regions corresponding to the terminal node. For a tree , define the cost complexity function. where is an impunity function, is a tuning parameter regularizing complexity.

The tree minimizes is the selected as optimal.

Impunity Functions

  • Misclassification error:
  • Gini index:
  • Cross entropy on deviance:
Link to original

Information Gain

Definition

The information gain of a node of a decision tree is defined as where is the impunity of node , is the number of samples used in , are children nodes of the .

Link to original

Feature Importance

Definition

The feature importance of a feature used for Decision Tree is defined as where is the Information Gain of a node , and is a set of nodes using the feature

Link to original

Confusion Matrix

Definition

Predicted Positive (PP)Predicted Negative (PN)
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Metrics

Accuracy

Definition

Link to original

Recall

Definition

Recall, Sensitivity, or True positive rate means that the rate of correctly predicted cases out of all the actual positive cases..

Link to original

Precision

Definition

Precision means that the rate of actually positive cases out of cases predicted as positive.

Link to original

Specificity

Definition

Specificity means that the rate of correctly predicted cases out of all the actual negative cases.

Link to original

Type 1 Error

Definition

Link to original

Type 2 Error

Definition

Link to original

F-Score

Definition

F1 Score

The harmonic mean of Precision and Recall.

F-beta score

where

Recall is considered times as important as Precision.

Link to original

Positive Predictive Value

Definition

where

Positive predictive value (in medical statistics and epidemiology) means that the rate of actually positive cases out of cases predicted as positive.

Link to original

Link to original

Receiver Operating Characteristic Curve

Definition

A receiver operating characteristic curve (ROC curve) is a plot of the True Positive Rate and False Negative Rate at each threshold setting.

AUC

The area under the ROC curve is called AUC (Area under curve)

Link to original

Support Vector Machines

Hyperplane

Definition

Vector Equation for a hyperplane

Facts

Suppose the dimension of is

  • : point in -dimensional space.
  • : line in -dimensional space.
  • : plane in -dimensional space.
  • : hyperplane in -dimensional space.

is perpendicular to the surface

where is a signed distance, , and

Link to original

Kernel Function

Definition

where

The kernel function returns the result of Inner product in by using only the original input data.

Examples

Some common kernel functions are as follows:

  • Linear:
  • Polynomial:
  • Gaussian radial basis function:
  • Sigmoid:
Link to original

Support Vector Machine

Definition

Hard-Margin Support Vector Machine

Assume that we have a learning set Suppose given two classes of data can separate by a Hyperplane without error. The hyperplane is called a separating hyperplane (SH)

Define , , and the margin of SH A SH which maximizes its margin () is called an optimal separating hyperplane (OSH). To find OSH, set linear constraints and satisfy where the minimum distance is arbitrary and may differ.

Let and . Then the points lying either on or are called support vectors. If and , then and

Therefore, the OHS can be obtained by a Convex Optimization problem. It can be solved by Method of Lagrange Multipliers with

By Duality of optimization problem, we have the dual optimization problem is defined as where

And the dual Lagrangian function is defined as

The primal optimization problem is convex and satisfies KKT Conditions. Thus, holds Strong Duality and the solution of the dual problem is the same as the primal problem.

The optimal parameter is obtained as where is an index set of support vectors

The optimal hyperplane can be written as and the classification rule is given by

Soft-Margin Support Vector Machine

where is the regularization parameter, and is called a slack variable. If , the point is out of margin. On the other hand, if , then the point is within the margin.

The Lagrangian primal function is defined as with and .

And the dual function is defined as with and where

and the dual optimization problem is defined as The primal optimization problem is convex and satisfies KKT Conditions. Thus, holds Strong Duality and the solution of the dual problem is the same as the primal problem.

The optimal parameter is obtained as where is an index set of support vectors.

Non-Linear Support Vector Machine

Non-linear SVM finds an optimal separating Hyperplane in high-dimensional feature space . It accomplished by the kernel trick. The kernel trick is that instead of computing inner products in , compute them using a non-linear Kernel Function in input space.

Hard-Margin Non-Linear Support Vector Machine

If the data can be separated in , then the dual optimization problem is defined as where

The optimal separating Hyperplane in the is

and the decision rule is defined as

Soft-Margin Non-Linear Support Vector Machine

In the non-separable case, the dual problem is defined as where

The optimal separating Hyperplane in the is

and the decision rule is defined as

Link to original

Support Vector Regression

Definition

Support vector regression (SVR) depends only on a subset of the training data, because the loss function ignores any training data close to the model prediction, within a band or tube.

The primal optimization problem is defined as

\min &\left( \cfrac{1}{2}||\boldsymbol{\beta}||^{2} + C\sum\limits_{i=1}^{n}(\xi_{i} + \xi'_{i}) \right)\\ \text{subject to}\ &y_{i} - (\beta_{0} + \boldsymbol{\beta}^{\intercal}\mathbf{x}_{i}) \leq \epsilon + \xi'_{i},\\ &(\beta_{0} + \boldsymbol{\beta}^{\intercal}\mathbf{x}_{i}) - y_{i} \leq \epsilon + \xi_{i},\\ &\ \xi'_{i} \geq 0, \ \xi_{i} \geq 0\quad i=1,\dots,n \end{aligned}$$ The term $\cfrac{1}{2}||\boldsymbol{\beta}||^{2}$ in the objective function, which appears with the intention of maximizing the margin of SVM, acts as a regularization parameter in SVR. The Lagrangian primal function is defined as $$ \begin{aligned} L_{P} &= \frac{1}{2}||\boldsymbol{\beta}||^{2} + C\sum_{{i=1}}^{{n}} (\xi_{{i}} + \xi'_{{i}}) \\ &+ \sum_{{i=1}}^{{n}} \alpha'_{{i}} \left( y_{{i}} - (\beta_{0} + \boldsymbol{\beta}^{\intercal} \mathbf{x}_{{i}}) - \epsilon - \xi'_{{i}} \right) + \sum_{{i=1}}^{{n}} \alpha_{{i}} \left( (\beta_{0} + \boldsymbol{\beta}^{\intercal} \mathbf{x}_{{i}}) - y_{{i}} - \epsilon - \xi_{{i}} \right)\\ &- \sum_{{i=1}}^{{n}} \mu'_{{i}} \xi'_{{i}} - \sum_{{i=1}}^{{n}} \mu_{{i}} \xi_{{i}} \end{aligned} $$ with $\mathbb{\alpha}, \mathbb{\alpha}', \boldsymbol{\mu}, \boldsymbol{\mu}' \geq \mathbf{0}$ The dual optimization is defined as $$ \begin{aligned} \max\ &\left( \mathbf{y}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) - \epsilon \mathbf{1}^{\intercal} (\mathbf{\alpha} + \mathbf{\alpha'}) - \frac{1}{2} (\mathbf{\alpha} - \mathbf{\alpha'})^{\intercal} \mathbf{K} (\mathbf{\alpha} - \mathbf{\alpha'}) \right) \\ \text{subject to}\ &\mathbf{1}_{n}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) = 0, \mathbf{0} \leq \mathbf{a}, \mathbf{a}' \leq C\mathbf{1}_{n}\quad i=1,\dots,n \end{aligned}$$ And the dual Lagrangian function is defined as $$\begin{aligned} L_{D} = &\ \mathbf{y}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'}) - \epsilon \mathbf{1}^{\intercal} (\mathbf{\alpha} + \mathbf{\alpha'}) - \frac{1}{2} (\mathbf{\alpha} - \mathbf{\alpha'})^{\intercal} \mathbf{K} (\mathbf{\alpha} - \mathbf{\alpha'}) \\ &+ \lambda \mathbf{1}_{n}^{\intercal} (\mathbf{\alpha} - \mathbf{\alpha'})\\ &- \boldsymbol{\mu}^{\intercal} \mathbf{\alpha} - \boldsymbol{\mu}'^{\intercal} \mathbf{\alpha'}+ \boldsymbol{\mu}_{C}^{\intercal} (C\mathbf{1}_{n} - \mathbf{\alpha}) + \boldsymbol{\mu}_{C}'^{\intercal} (C\mathbf{1}_{n} - \mathbf{\alpha'}) \end{aligned}$$ with $\boldsymbol{\mu}, \boldsymbol{\mu}', \boldsymbol{\mu}_{C}, \boldsymbol{\mu}_{C}' \geq \mathbf{0}$ where $\mathbf{K}$ is a [[Kernel Function]].Link to original

Ensemble Learning

Bootstrap Aggregating

Definition

Regression Setting

Given a training set , draw a bootstrap sample from training data. Bagging estimator of at is defined as where is the prediction at based on a new training set

Classification Setting

Given a training set , draw a bootstrap sample from training data. Bagging estimator of at is defined as where is the prediction at based on a new training set

Facts

Bagging in a classification setting with zero-one loss often fails. Bagging a good classifier can make it better, but bagging a bad classifier can make it worse.

Bagging reduces the variance of an estimated prediction function. It seems to work especially well for high-variance, low-bias procedures, such as trees.

Link to original

Boosting

Definition

Boosting builds a strong learner by combining many weak learners whose accuracies are slightly better than a random guess.

Examples

AdaBoost

Link to original

AdaBoost

Definition

The AdaBoost (Adaptive Boosting) is the weighted sum of weak learners that are robust to outliers and noise.

Algorithm

Consider a 2-class problem with

  1. Initialize the weights
  2. Repeat for :
    1. Fit a weak classifier that minimizes the weighted error rate and call the fitted classifier , and its corresponding error rate
    2. Compute
    3. Update the weights by
  3. Output the classifier

Facts

AdaBoost is equivalent to Gradient Boosting using the exponential loss

Link to original

Gradient Boosting

Definition

The gradient boosting is a type of Forward Stagewise Additive Modeling that uses gradients to calculate the residuals.

Algorithm

  1. Initialize the model
  2. Repeat for :
    1. Calculate the residuals for each data point where is a Loss Function
    2. Fit weak learners to the residuals
    3. Calculate a learning rate
    4. Update model using the fitted weak learners
  3. Output the final model
Link to original

Forward Stagewise Additive Modeling

Definition

The forward stagewise additive modeling sequentially adds new basis functions to the model of the previous step without adjusting the parameters and coefficients of those that have already been added.

Algorithm

  1. Initialize
  2. For
    1. Compute
    2. Set where is a basis function and is a Loss Function
Link to original

Random Forest

Definition

Random forest constructs a multitude of de-correlated decision trees at training time and make a prediction through Bagging.

Algorithm

  1. For :
    1. Draw a size bootstrap sample from training data.
    2. Grow a random-forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size is reached.
      1. Select variables at random.
      2. Pick the best variable and split point among the selected variables.
      3. Split the node into two children nodes.
  2. Output the ensemble of trees to make a prediction at a new point
    • Regression:
    • Classification: where is the class prediction of the th random-forest tree.

Facts

Selection of hyperparameter is recommended as

  • For classification:
  • For regression:
Link to original

Neural Networks

Projection Pursuit Regression

Definition

where are unspecified function

Facts

PPR is a single-layered Neural Network where each neuron uses a different activation function and has no bias term.

Link to original

Activation Function

Definition

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on the linear combination of its inputs. It is used to add a non-linearity to the model.

Examples

Logistic Function

Definition

The logistic function is inverse function of Logit.

Facts

Sigmoid activation function is vulnerable to vanishing gradient problem. The image of the derivative of the sigmoid function is . For this reason, after passing node with sigmoid Activation Function, the gradient is decreased

Also, with the sigmoid Activation Function, if all the inputs are positive, then all the gradients also positive.

Link to original

Hyperbolic Tangent Function

Definition

Link to original

Rectified Linear Unit Function

Definition

Facts

If an initial value is negative, it is never updated.

Link to original

ReLU6

Definition

Link to original

Gaussian-Error Linear Unit

Definition

GELU is a smooth approximation of ReLU.

where is the CDF of the standard normal distribution.

Link to original

Parametric ReLU

Definition

where is a hyperparameter

Facts

If , it is called a Leaky ReLU

Link to original

Exponential Linear Unit

Definition

where is a hyperparameter

Link to original

Swish Function

Definition

where is Sigmoid Function, and is a hyperparameter

When , the function is called a sigmoid liniear unit (SiLU).

Link to original

Link to original

Sum of Squared Errors Loss

Definition

Suppose the number of data is , then the sum of squared error loss (SSE) is defined as the sum of Squared Error Loss of all data.

Link to original

Cross-Entropy Loss

Definition

Suppose the number of data is and the number of classes is , then the cross entropy loss is defined as

Link to original

Neural Network

Definition

Neural network can be thought as a non-linear generalization of linear model.

The derived features are constructed by an Activation Function and linear combinations of the inputs. where is an Activation Function

Output nodes are the linear combinations of And the output is modeled by a function of a linear combinations of where is called an output function.

Facts

The output function varies by the problem. For regression is Identity Function, and for -class classification Softmax Function is used as the .

For regression problem, Sum of Squared Errors Loss is used as Loss Function. For classification problem, we use Cross-Entropy Loss

With the softmax activation function and the Cross-Entropy Loss, the neural network model is exactly a linear Logistic Regression model in the hidden units.

The parameters of a neural network are estimated by Backpropagation.

Neural network is especially effective in problem with a high signal-to-noise ratio.

Link to original

Dropout

Definition

Dropout is a regularization technique used for Neural Network. Dropout randomly dropping out or omitting units during training process of a Neural Network.

Link to original

Clustering Analysis

K-Means Clustering

Definition

The k-means clustering algorithm solves the following optimization problem where is the number of clusters, a Partition is a set of clusters, and The k-means clustering algorithm aims to partition observations into clusters in which each observation belongs to the cluster with the nearest mean.

Algorithm

  1. Initialize .

  2. Assignment step: Assign each observation to the cluster with the nearest mean

    where each is assigned to exactly one

  3. Update step: Recalculate means for observations assigned to each cluster.

  4. Iterate step 2 and 3 until the within cluster variation converges.

Facts

The k-means clustering uses spherical metric (Euclidean distance) to group data, so that it is useful only when clusters are convex sets.

Link to original

Self-Organizing Map

Definition

A self-organizing map (SOM) is a clustering method that produces a low-dimensional representation of a higher-dimensional data set while preserving the Topological Manifold structure of the data.

The algorithm fits a grid to high-dimensional data and assigns the data to the fitted nodes (prototypes) of the grid.

Link to original

Spectral Clustering

Definition

Spectral clustering technique make use of spectrum of the similarity matrix of the data to perform dimensionality reduction before clustering.

Algorithms

Apply Laplacian Eigenmap to the given data .

Form an sub matrix using the first -columns of the embedded matrix , and normalize each row to norm .

Apply clustering algorithm (e.g. K-Means Clustering) to , where is the -th row of .

Mathematical Background

Consider a cluster assignment vector for each data point. Then, clustering observations is corresponding to estimating ‘s The entries of the Adjacency Matrix represents the similarity between points and .

Under the assumption that the close data points (large ) have a similar label (), the optimization problem is set up that minimizing the difference between assignments for similar points. where the constraint is imposed to avoid the trivial solution .

In a matrix notation, the problem is

When , is large and when , is small. So, works as the weight for each pair in the optimization.

The Lagrangian function is defined as And the solution of the problem is given by an Eigenvalue problem. The solution is an matrix of eigenvectors sorted in ascending by their corresponding eigenvalues.

Link to original

Procrustes Analysis

Definition

Procrustes Transformation without Scaling

Consider two matrices and a target . What we want is to find a Procrustes transformation of whose result is closest to ,

The optimization problem is defined as where is a Orthonormal Matrix (perform rotation and reflection), act as location parameter, and is a Frobenius Norm.

Let and be the columnwise mean vectors of the matrices, and and be the demeaned matrices. Consider the SVD . Then, the solution of the optimization problem is given by and the minimal distance is referred to as the Procrustes distance.

Procrustes Transformation with Scaling

Consider demeaned matrices and . The Procrustes distance with scaling is obtained from more general optimization problem. where is a positive scalar.

The solution for is as before (), with

Procrustes Average

Consider demeaned and scaled matrices . Procrustes average problem finds the shape closest in average squared Procrustes distance to all the given shapes. This is solved by a simple algorithm

  1. Initialize
  2. Solve the Procrustes rotation problems with fixed, yielding
  3. Let
  4. Iterate step 1 and 2 until it converges.
Link to original

Kernel Principal Component Analysis

Definition

Consider an demeaned matrix By SVD , where are orthonormal matrices and is a Diagonal Matrix. Then, By PCA . For a linear kernel , . is an Orthonormal Matrix, The kernel principal components are given by solution of the optimization problem. where

Link to original