Loss Functions and Optimization

Loss Functions

Cross-Entropy Loss
Definition

Suppose the number of data is $n$ and the number of classes is $K$ , then the cross entropy loss is defined as $R (θ) = - i = 1 \sum n k = 1 \sum K y_{ik} ln f_{k} (x_{i})$
Link to original

Kullback-Leibler Divergence
Definition

Assume that two Probability Distributions $P$ and $Q$ are given. Then the Kullback-Leibler divergence between $P$ and $Q$ is defined as $D_{K L} (P ∣∣ Q) = E_{x \sim P} [ln (\frac{p ( x )}{q ( x )})] = \int_{x} p (x) ln \frac{p ( x )}{q ( x )} d x$

Kullback-Leibler divergence measures how different two distributions are.

It also can be expressed as a difference between the cross entropy $H (p, q)$ (difference between distributions $p$ and $q$ )and entropy $H (p)$ (inherent uncertainty of $p$ ). $H (P, Q) - H (P) = \int_{x} p (x) ln \frac{1}{q ( x )} d x + \int_{x} p (x) ln p (x) d x = \int_{x} p (x) ln \frac{p ( x )}{q ( x )} d x = D_{K L} (P ∥ Q)$

Facts

Let ${P_{n}}_{n \in N}$ be a sequence of distributions. Then, $D_{K L} (P_{n} ∣∣ P) \to 0 ⟹ D_{J S} (P_{n}, P) \to 0 ⟺ δ (P_{n}, P) \to 0 ⟹ W (P_{n}, P) \to 0 ⟺ P_{n} \to D P$ The convergence of the KL-Divergence to zero implies that the JS-Divergence also converges to zero. The convergence of the JS-Divergence to zero is equivalent to the convergence of the Total Variation Distance to zero. The convergence of the Total Variation Distance to zero implies that the Wasserstein Distance also converges to zero. The convergence of the Wasserstein Distance to zero is equivalent to the Convergence in Distribution of the sequence.

Link to original

Optimization

Gradient Descent
Definition

An iterative optimization algorithm for finding a local minimum of a differentiable function

$x_{n + 1} = x_{n} - γ \nabla f (x_{n})$ where $γ$ is a learning rate

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient is $\nabla f (x) = - 2 (b - A x)^{⊺}$

Then, solution is $x_{n + 1} = x_{n} + 2 γ A^{⊺} (b - A x_{n})$
Link to original

Cross Validation
Definition

Partition a set of data into $K$ sets $I_{j}, j = 1, 2, \dots, K$ and denote a function $κ : {1, 2, \dots, n} \to {1, 2, \dots, K}$ such that $κ (i) = j \Leftrightarrow i \in I_{j}$ . For the dataset ${(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})}$ , let $\hat{f}_{- j}$ be the estimator based on observations except $I_{j}$ , then the cross validation estimation for the prediction error is defined as $\hat{PE}_{C V} (\hat{f}) = \frac{1}{n} i = 1 \sum n L (Y_{i}, \hat{f}_{- κ (i)} (X_{i}))$ where $L$ is a loss function

Facts

If the data is partitioned into $k$ group with equal size, then it is called a $k$ -fold cross validation.

Link to original

Neural Networks

Neural Networks and Backpropagation

Neural Network
Definition

Neural network can be thought as a non-linear generalization of linear model.

The derived features $Z_{m}$ are constructed by an Activation Function and linear combinations of the inputs. $Z_{m} = σ (α_{0 m} + α_{m}^{⊺} X), m = 1, \dots, M$ where $σ$ is an Activation Function

Output nodes are the linear combinations of $Z$ $T_{k} = β_{0 k} + β_{k}^{⊺} Z, k = 1, \dots, K$ And the output is modeled by a function of a linear combinations of $Z_{m}$ $f_{k} (X) = g_{k} (T), k = 1, \dots, K$ where $g_{k} (T)$ is called an output function.

Facts

The output function $g_{k} (T)$ varies by the problem. For regression $g_{k}$ is Identity Function, and for $k$ -class classification Softmax Function $σ$ is used as the $g_{k}$ .

For regression problem, Sum of Squared Errors Loss is used as Loss Function. For classification problem, we use Cross-Entropy Loss

With the softmax activation function and the Cross-Entropy Loss, the neural network model is exactly a linear Logistic Regression model in the hidden units.

The parameters of a neural network are estimated by Backpropagation.

Neural network is especially effective in problem with a high signal-to-noise ratio.

Link to original

Backpropagation
Definition

Backpropagation is a gradient estimation method used for training neural networks. The gradient of a loss function with respect to the weights is computed iteratively from the last layer to the input.

Without backpropagation, we would need to calculate the gradient of the weights of each layer independently. However, we can reuse past layers’ gradients and can avoid redundant calculations with backpropagation. Also, in a computational aspect, the calculations of gradients within the same layer can be parallelized.

Patterns in Gradient Flow

Algorithm

Make a prediction and calculate the loss using the data (feedforward step)

Update gradients using the chain rule and obtained results from feedforward step (backpropagation)

The downstream gradients of a node are calculated by the product of the upstream gradient and the local gradient. $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \frac{\partial z}{\partial x}$

Scalar Case

Vector Case

Examples

Link to original

Activation Function
Definition

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on the linear combination of its inputs. It is used to add a non-linearity to the model.

Examples

Logistic Function
Definition

$σ (x) = logistic (x) = logit^{- 1} (x) = \frac{1}{1 + e x p ( - x )}$

The logistic function is inverse function of Logit.

Facts

Sigmoid activation function is vulnerable to vanishing gradient problem. $\frac{d}{d x} σ (x) = σ (x) (1 - σ (x))$ The image of the derivative of the sigmoid function is $(0, 0.25]$ . For this reason, after passing node with sigmoid Activation Function, the gradient is decreased

Also, with the sigmoid Activation Function, if all the inputs are positive, then all the gradients also positive.

Link to original

Hyperbolic Tangent Function
Definition

$tanh (x) = \frac{sinh x}{cosh x} = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}$
Link to original

Rectified Linear Unit Function
Definition

$f (x) = max (0, x)$

Facts

If an initial value is negative, it is never updated.

Link to original

ReLU6
Definition

$f (x) = min (max (0, x), 6)$
Link to original

Gaussian-Error Linear Unit
Definition

GELU is a smooth approximation of ReLU.

$f (x) = x Φ (x)$ where $Φ$ is the CDF of the standard normal distribution.
Link to original

Parametric ReLU
Definition

$f (x) = max (αx, x)$ where $α \leq 1$ is a hyperparameter

Facts

If $α = 0.01$ , it is called a Leaky ReLU

Link to original

Exponential Linear Unit
Definition

$f (x) = {x α (e^{x} - 1) x \geq 0 x < 0$ where $α \geq 0$ is a hyperparameter
Link to original

Swish Function
Definition

$f (x) = x σ (β x) = \frac{x}{1 + e ^{- β x}}$ where $σ$ is Sigmoid Function, and $β$ is a hyperparameter

When $β = 1$ , the function is called a sigmoid liniear unit (SiLU).
Link to original
Link to original

Weight Initialization

LeCun Initialization
Definition

LeCun initialization is designed for Neural Network using the Tanh activation function.

For Normal Distribution: $W \sim N (0, \frac{1}{n _{in}})$ For Uniform Distribution $W \sim U (- \frac{1}{n _{in}}, \frac{1}{n _{in}})$ where $n_{in}$ is the number of input nodes.
Link to original

Xavier Initialization
Definition

Xavier initialization is designed for Neural Network using the sigmoid or Tanh activation function.

For Normal Distribution: $W \sim N (0, \frac{2}{n _{in} + n _{out}})$ For Uniform Distribution $W \sim U (- \frac{6}{n _{in + n_{out}}}, \frac{6}{n _{in + n_{out}}})$ where $n_{in}$ is the number of input nodes, and $n_{out}$ is the number of output nodes.
Link to original

He Initialization
Definition

He initialization is designed for Neural Network using the ReLU activation function.

For Normal Distribution: $W \sim N (0, \frac{2}{n _{in}})$ For Uniform Distribution $W \sim U (- \frac{6}{n _{in}}, \frac{6}{n _{in}})$ where $n_{in}$ is the number of input nodes, and $n_{out}$ is the number of output nodes.
Link to original

Convolutional Neural Networks

Convolutional Neural Network
Definition

A convolutional neural network (CNN) is a regularized type of fully connected Neural Network that learns features by itself via filter optimization. It consists of convolution layers.

Convolutional Layer

The layer’s parameters consist of a set of learnable filters that slide over the input image. Each filter performs a Convolution operation, computing the Dot Product between the filter values and the input values at each position. The output of the convolution operation is a feature map.

The output size is determined by the input size, filter size, padding, and stride. $output size = \frac{N - F + 2 P}{S} + 1$ where $N$ is the input size, $F$ is the filter size, $P$ is padding, and $S$ is the stride.

Stride

Stride refers to the step size the convolution filter moves each time it slides over the input.

Padding

Padding adds extra border pixels around the input images. It preserves spatial dimension of the feature map, and retains information at the borders.
Link to original

VGG Net
Definition

VGG model is a deep Convolutional Neural Network architecture. The VGG model is characterized by its depth and uniformity. It consists of a series of convolutional layers followed by fully connected layers.
Link to original

Inception Net
Definition

Inception Net model is a deep Convolutional Neural Network architecture using the inception module.

Architecture

Inception Net V1

Inception Module

Inception module is a building block of the inception net. It uses multiple filter sizes ( $1 \times 1$ , $3 \times 3$ , and $5 \times 5$ ) and pooling operations in parallel, allowing the network to capture features at different scales simultaneously.

The $conv (1 \times 1)$ are used for dimensionality reduction, helping to reduce computational complexity.

Inception Net V2, V3

Factorized Convolution

The large convolutions in the inception module were replaced with multiple smaller convolutions $conv (5 \times 5) \to conv (3 \times 3) * 2$ reducing parameters and computational cost.

Asymmetric Convolution

$conv (n \times n)$ are decomposed into $conv (1 \times n)$ and $conv (n \times 1)$ convolution.

Label Smoothing

The model prevent from becoming overconfident applying the label smoothing to the labels. $y_{i}^{'} = (1 - ϵ) y_{i} + \frac{ϵ}{K}$ where $y_{i}^{'}$ is the smoothed label, $y_{i}$ is the original one-hot encoded label, $ϵ$ is the smoothing parameter, and $K$ is the number of classes.
Link to original

ResNet
Definition

ResNet is a deep Convolutional Neural Network architecture. It was designed to address the degradation problem in very deep neural networks.

Architecture

Skip Connection

The core innovation of ResNet is the introduction of skip connections (shortcut connections or residual connections). These connections allow the network to bypass one or more layers, creating a direct path for information flow. It performs identity mapping, allowing the network to easily learn the identity function if needed.

The residual block is represented as $y = F (x) + x$ where $x$ is the input to the block, $F (x)$ is the learnable residual mapping typically including multiple layers, and $y$ is the output of the block

Skip connection create a mixture of deep and shallow models. $N$ skip connections, makes $2^{N}$ possible paths, where each path could have up to $N$ modules.

Bottleneck Block

The bottleneck architecture is used in deeper versions of ResNet to improve computational efficiency while maintaining or increasing the network’s representational power. The bottleneck block consists of three layers in sequence: $1 \times 1 \to 3 \times 3 \to 1 \times 1$ convolutions. The first $1 \times 1$ convolution reduces the number of channels, the $3 \times 3$ convolution operates on the reduced representation, and the second $1 \times 1$ convolution increases the number of channels back to the original.

Architecture Variants

Full Pre-Activation

Full pre-activation is an improvement to the original ResNet architecture. This modification aims to improve the flow of information through the network and make training easier. In full pre-activation, the order of operations in each residual block is changed to move the batch normalization and activation functions before the convolutions.

WideResNet

WideResNet increases the number of channels in the residual blocks rather than increasing Network’s depth.

ResNeXt

The ResNeXt model substitutes the $3 \times 3$ convolution of residual block of ResNet with the Grouped Convolution. ResNeXt achieve better performance than ResNet with the same number of parameters, thanks to its more efficient use of model capacity through the grouped convolution.
Link to original

Regularization for Neural Networks

Ridge Regression
Definition

$\hat{β}_{ridge} = (X^{⊺} X + λ I)^{- 1} X^{⊺} y = β argmin (y - X β)^{⊺} (y - X β) + λ β^{⊺} β = β argmin (y - X β)^{⊺} (y - X β), subject to β^{⊺} β \leq c$ where $λ \geq 0$ is a complexity parameter that controls the amount of shrinkage.

Ridge regression is particularly useful to mitigate the problem of Multicollinearity in linear regression

Facts

$\hat{y}_{ridge} = X \hat{β}_{ridge} = X (X^{⊺} X + λ I)^{- 1} X^{⊺} y = i = 1 \sum p u_{j} \frac{d _{j}^{2}}{d _{j}^{2} + λ} u_{j}^{⊺} y$ where $X = UD V^{⊺}$ by Singular Value Decomposition

Link to original

Lasso Regression
Definition

$\hat{β}_{lasso} = β argmin (y - X β)^{⊺} (y - X β) + λ ∣∣ β ∣∣ = β argmin (y - X β)^{⊺} (y - X β), subject to ∣∣ β ∣∣ \leq c$

Lasso model assume that the coefficients of the model are sparse.
Link to original

Dropout
Definition

Dropout is a regularization technique used for Neural Network. Dropout randomly dropping out or omitting units during training process of a Neural Network.
Link to original

Optimization

Stochastic Gradient Descent
Definition

Stochastic gradient descent (SGD) is a stochastic approximation of Gradient Descent. It replaces the actual gradient (calculated from the entire dataset) with an estimation of it by randomly selecting a subset of the data.

$x_{t + 1} = x_{t} - α \nabla f (x_{t})$ where $α$ is the learning rate.
Link to original

Optimizers
Definition

Optimizers are algorithms used to adjust the parameters of a model to minimize the loss function. The optimizers aim to improve the convergence speed and stability of the training process compared to standard Stochastic Gradient Descent.

Examples

Momentum Optimizer
Definition

Momentum optimizer remembers the update at each iteration, and determines the next update as a linear combination of the gradient and the previous update

$v_{t + 1} x_{t + 1} = ρ v_{t} - α \nabla f (x_{t}) = x_{t} + v_{t + 1}$ where $0 \leq ρ \leq 1$ is the momentum coefficient.
Link to original

AdaGrad Optimizer
Definition

Adaptive gradient descent (AdaGrad) is a Gradient Descent with parameter-wise learning rate.

$g_{t} = i = 1 \sum t (\nabla f (x_{i}))^{2}$ $x_{t + 1} = x_{t} - α \frac{\nabla f ( x _{t} )}{g _{t} + ϵ}$ where $g_{t}$ is the sum of squares of past gradients, and $ϵ$ is a small constant to prevent division by zero.
Link to original

RMSProp Optimizer
Definition

RMSProp optimizer resolves AdaGrad Optimizer’s rapidly diminishing learning rates and relative magnitude difference by taking the exponential moving average on history.

$g_{t} = γ g_{t - 1} + (1 - γ) (\nabla f (x_{t}))^{2}$ $x_{t + 1} = x_{t} - α \frac{\nabla f ( x _{t} )}{g _{t} + ϵ}$ where $γ$ is the decay rate.
Link to original

Adam Optimizer
Definition

Adaptive momentum estimation (Adam) combines the ideas of momentum and RMSProp optimizers.

$m_{t} v_{t} \overset{m}{^}_{t} \overset{v}{^}_{t} x_{t + 1} = β_{1} m_{t - 1} + (1 - β_{1}) \nabla f (x_{t}) = β_{2} v_{t - 1} + (1 - β_{2}) (\nabla f (x_{t}))^{2} = \frac{m _{t}}{1 - β _{1}^{t}} = \frac{v _{t}}{1 - β _{2}^{t}} = x_{t} - α \frac{m ^ _{t}}{v ^ _{t} + ϵ}$ Where:

$m_{t}$ is the estimate of the first moment (mean) of the gradients

$v_{t}$ is the estimate of the second moment (un-centered variance) of the gradients

$β_{1}$ and $β_{2}$ are decay rates for the moment estimates

$\overset{m}{^}_{t}$ and $\overset{v}{^}_{t}$ are bias-corrected estimates

Link to original
Link to original

Normalization

Batch Normalization
Definition

Batch normalization (batch norm) make training of neural network faster through normalization of each layers’ input by re-centering and re-scaling $BN (x) = γ \frac{x - μ _{batch}}{σ _{batch}^{2} + ϵ} + β$ where:

$μ_{batch} = \frac{1}{B} i = 1 \sum B x_{i}$ is the mean vector of the mini-batch

$σ_{batch}^{2} = \frac{1}{B} i = 1 \sum B (x_{i} - μ_{batch})^{2}$ is the variance vector of the mini-batch

$B$ is the batch size.

$γ$ , $β$ are the learnable scaling parameters.

$ϵ$ is a small constant to prevent division by zero.

In test stage, the training stage’s moving average of the $μ_{batch}$ and $σ_{batch}$ are used.
Link to original

Layer Normalization
Definition

Layer normalization (layer norm) normalizes the inputs across the features for each sample. In a fully connected layer, layer norm is applied across all the neurons in that layer for each input. In a convolutional layer, layer norm is applied across the channel dimension for each spatial position.
Link to original

Action Recognition Models

! !

Two-Stream Models

Two-Stream Network
Definition

Explicitly separate appearance (spatial) and motion (temporal) in training, the features are combined at scoring stage. From the input video, model takes a single frame and feed it to spatial stream which has CNN structure. To the motion stream, feed pre-calculated optical flows, stacked as a channel, to learn temporal dynamics, independent of appearance.
Link to original

Two-Stream Network Fusion
Definition

An improved version of Two-Stream Network. Instead of fusing at the scoring step, two streams are fused in the middle.
Link to original

Temporal Segment Network
Definition

Temporal Segment Network (TSN) sample more than one images for better long-range temporal modeling. Also, Batch Norm and Dropout is utilized.
Link to original

Hidden Two-Stream Network
Definition

Hidden two-stream network substitutes the optical flow, used for the Two-Stream Network, with the motion net estimating optical flow
Link to original

3D Convolutional Models

3D Convolution
Definition

3d convolution utilizes 3d convolutional filter for the 3d input (depth, height, width). Similar to the 2d convolution, the kernel slides across the input volume in all three dimensions, computing the Dot Product between the filter values and the input values at each position. The output of the convolution operation is a 3d feature map.
Link to original

3D Convolutional Network
Definition

3D Convolutional Network (C3D) model applies 3D Convolution on video volume.
Link to original

3D Residual Networks
Definition

3D Residual Networks (R3D) model applied resnet structure on 3D Convolutional Network model.
Link to original

(2+1)D Residual Networks
Definition

(2+1)D Residual Networks (R(2+1)D) model decomposes the 3d convolution used for R3D into spacial and temporal axes.
Link to original

Temporal 3D ConvNet
Definition

Temporal 3D ConvNet (T3D) applied DenseNet structure on 3D Convolutional Network model. The 3D Temporal Transition Layer, similar to the inception module of GoogLeNet), stacked after DenseBlock to capture different temporal lengths

Also, the model utilized the pre-trained 2D ConvNet as a teacher to make the 3D ConvNet learn mod-level feature representation by image-video correspondence task. During the training, the model parameters of the 2d ConvNet is frozen.
Link to original

Two-Stream Inflated 3D ConvNet
Definition

Two-Stream Inflated 3D ConvNet (I3D) combines the idea of Two-Stream Network and 3D Convolutional Network. Both spatial and temporal stream has 3D Convolutional Network structure with InceptionNet backbone.
Link to original

Seperable 3D CNN
Definition

Seperable 3D CNN (S3D) model decomposes the 3d inception block used for Two-Stream Inflated 3D ConvNet into spacial and temporal axes.
Link to original

SlowFast Network
Definition

SlowFast Network has a two-stream structure: slow pathway and fast pathway, where both stream have 3D Residual Networks structure. The slow pathway has low frame rate and high channel size, and the fast pathway has high frame rate and low channel size. The connections across pathways allows each pathway to be aware of the representation learned by the another pathway.
Link to original

Expand 3D CNN
Definition

Expand 3D CNN (X3D) model tried to find the optimal parameters for 3D Residual Networks using AutoML method. It tunes six parameters:

X-Fast ( $γ_{τ}$ ): the input frame rate (temporal resolution)

X-Temporal ( $γ_{t}$ ): the number of frames in the input

X-Spatial ( $γ_{s}$ ): the spacial resolution

X-Depth ( $γ_{d}$ ): the depth of the network

X-Width ( $γ_{w}$ ): the number of channels for all layers

X-Bottleneck ( $γ_{b}$ ): the inner channel width of the center convolutional filter in each residual block

Link to original

Recurrent Neural Networks

Recurrent Neural Network
Definition

Recurrent neural network (RNN) is a class of Neural Network for sequential data processing. RNNs maintain an internal state, representing the semantics of the input sequence processed so far, which is updated at each step based on the current input and the previous hidden state. RNNs can process input sequences of any length, and the model size doesn’t increase for longer input. However, due to the sequential structure, the computation is slow, and it suffers from the vanishing gradient problem in training.

Applications

One-to-Many

Many-to-One

Many-to-Many

Sequence-to-Sequence

Link to original

Long Short-Term Memory
Definition

Long short-term memory (LSTM) is a type of RNN aimed at dealing with the vanishing gradient problem. The model is composed of a cell, an input gate, an output gate, and a forget gate. The three gates regulate the flow of information into and out of the cell.

Architecture

where:

$f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})$

$i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})$

$o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})$

$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})$

$f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})$

$h_{t} = o_{t} ⊙ tanh (c_{t})$

Link to original

Gated Recurrent Unit
Definition

Gated recurrent unit (GRU) is like a Long Short-Term Memory with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM.

Architecture

where:

$r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})$

$z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})$

$h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ tanh (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}) + b_{h})$

Link to original

RNN-based Video Models

Long-Term Recurrent Convolutional Network
Definition

Long-term recurrent convolutional network (LRCN) is a direct application of LSTM idea to action recognition. CNN features are fed into the LSTM cell as input.
Link to original

Beyond Short Snippets
Definition

Beyond Short Snippets (BSS) utilized optical flow, in addition to the CNN features to better understand longer videos. For frame aggregation, both pooling and LSTM are considered

Poolings

Link to original

Fully-Connected LSTM
Definition

Fully-connected LSTM introduced additional connections from the cell state to the FC layers for each gate in LSTM cell.

Architecture

where:

$f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + V_{f} c_{t - 1} + b_{f})$

$i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + V_{i} c_{t - 1} + b_{i})$

$o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + V_{o} c_{t - 1} + b_{o})$

$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})$

$h_{t} = o_{t} ⊙ tanh (c_{t})$

Link to original

Convolutional LSTM
Definition

The input and hidden state of convolutional LSTM have a matrix form, and the fully connected layer in the LSTM cell is substituted with convolutional layer.

$f_{t} = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + W_{c f} ⊙ C_{t - 1} + b_{f})$

$i_{t} = σ (W_{x i} * X_{i} + W_{hi} * H_{t - 1} + W_{c i} ⊙ C_{t - 1} + b_{f})$

$o_{t} = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + W_{co} ⊙ C_{t - 1} + b_{o})$

$C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ tanh (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c})$

$H_{t} = o_{t} ⊙ tanh (C_{t})$ where $W_{..}$ are convolution filters.

Link to original

Convolutional GRU
Definition

Convolutional GRU applied the idea of Convolutional LSTM to Gated Recurrent Unit. When stacking multiple ConvGRU layers, each layer takes previous layer’s hidden state as an extra input.

$r_{t} = σ (W_{x r} * x_{t} + W_{h r^{l}}^{l} * h_{t}^{l - 1} + W_{h r} * h_{t - 1})$

$z_{t}^{l} = σ (W_{x z}^{l} * x_{t} + W_{h z^{l}}^{l} * h_{t}^{l - 1} + W_{h z}^{l} * h_{t - 1})$

$h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ tanh (W_{h x} * x_{t} + W_{hh} * (r_{t} ⊙ h_{t - 1}))$ where $W_{..}$ are convolution filters.

Link to original

Transformers

Attension

Attention
Definition

Attention is a method that determines the relative importance of each component in a sequence relative to the other components in that sequence.

The attention function is formulated as $Attention (Q, K, V) = softmax (\frac{Q K ^{⊺}}{d _{k}}) V$ where Q (query) represents the current context, K (key) and V (value) represents the references, and $d_{k}$ is the dimension of the keys.

The attention value is the convex combination of values, where each weight is proportional to the relevance between the query and the corresponding key.

Examples

Seq-to-Seq with RNN

Link to original

MultiLSTM
Definition

MultiLSTM is similar to the LRCN. The model applies Attention to recent N-input features instead of simply taking a single feature. The previous hidden state of LSTM is used as a query, and the recent N-input features are used as a key and value.
Link to original

Attention LSTM
Definition

Attention LSTM is a variant of LSTM architecture incorporating Attention mechanism. In a sequence-to-sequence setting, the model uses Attention in the decoding stage. The previous hidden state of LSTM cell is used as the query, and the hidden states of LSTM cell of the encoder are used as the key and value.
Link to original

Visual Attention
Definition

Visual Attention model applies Attention spatially, the hidden state of LSTM ( $x_{t - 1}$ : 1, 1, channels) is used as a query, and the CNN feature map ( $X_{t}$ : height, width, channels) are used as a key and value. The result of the attention ( $x_{t}$ : 1, 1, channels) is used as the hidden state of the next step.
Link to original

Word Embeddings

Word2Vec
Definition

The Word2Vec model learns vector representations of words that effectively capture the semantic relationships them using large corpus of text.

Continuous Bag-Of-Words

Continuous bag-of-words (CBOW) predicts a target word given its context words. The model takes a window of context words around a target word. The context words are fed into the network, and it tries to predict the target word. The model is trained to maximize the probability of the target word given the context words.

Algorithm

Generate one-hot encodings ${x_{1}, x_{2}, \dots, x_{C}}$ of the context words $v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}$ of window size $C$ .

Get embedded words vectors ${h_{1}, h_{2}, \dots, h_{C}}$ using linear transformation, and take average of them $h = \frac{1}{C} \sum_{i = 1}^{C} x_{i} W$ where the weight $W$ is shared

Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$

Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the one-hot encoding of actual output word $v_{O}$ by minimizing the loss $L = - ln P (v_{O} ∣ v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}) = - i = 1 \sum ∣ V ∣ v_{O_{i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{i}}$ is the $i$ -th element of the one-hot encoded output word vector.

Skip-Gram

The Skip-gram model does the opposite of CBOW. It predicts the context words given a target word The model takes a single target word as input. The target word is fed into the network, and it tries to predict the surrounding context words. The model is trained to maximize the probability of each context word given the target word.

Algorithm

Generate one-hot encoding $x$ of the input word $v_{I}$

Get an embedded word vector $h$ using linear transformation $h = x W$

Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$

Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the $C$ many one-hot encodings of the actual output word $v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}}$ by minimizing the loss $L = - ln P (∣ v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}} ∣ v_{I}) = - \sum_{j = 1}^{C} \sum_{i = 1}^{∣ V ∣} v_{O_{j, i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{j, i}}$ is the $i$ -th element of the one-hot encoded output $j$ -th context word vector.

Link to original

Noise Contrastive Estimation
Definition

Noise contrastive estimation (NCE) transforms the problem of density estimation to binary classification between data samples and noise samples. Given a sample of data points that follow unknown probability distribution $p_{m} (\cdot; θ^{*})$ parametrized by $θ$ . Noise contrastive estimation (NCE) is used to find an estimator $\hat{θ}$ that best approximate the true parameter $θ^{*}$ . Although MLE has good properties, it requires the parametric family $p_{m} (\cdot; θ)$ to be normalized $\int p_{m} (x; θ) d x = 1$ while calculating. NCE finds the estimator by maximizing an objective function (like in MLE) but without needs of normalizing while calculating, so that it treats normalization coefficient as another estimation parameter. It is desirable property, since the normalization constant may be difficult or expensive to calculate.

The idea is to pollute sample with noise, data points that come from a known distribution, and perform nonlinear logistic regression to discriminate between data and noise.

Consider a data sample $X = (x_{1}, x_{2}, \dots, x_{M}), x_{i} \sim p_{m} (\cdot; θ)$ and noise sample $Y = (y_{1}, y_{2}, \dots, y_{N}), p_{n} (\cdot)$ and the union of the two samples $U = X \cup Y$ . A binary class label $C_{i}$ is assigned to each $u_{i}$ , where $C_{i} = {10 u_{i} \in X u_{i} \in Y$ .

By the definition of the function, $P (u ∣ C = 1) = p_{m} (u; θ)$ and $P (u ∣ C = 0) = p_{n} (u)$ . By the Bayes Theorem, $P (C = 1∣ u) = \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ and $P (C = 0∣ u) = 1 - \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ , where $ν = \frac{N}{M}$ is the sample-noise ratio. The log-likelihood function of the binary classification problem is derived as $l (θ) = i = 1 \sum M ln [P (C = 1∣ x_{i})] + i = 1 \sum N ln [1 - P (C = 1∣ y_{i})]$ The NCE estimator is obtained by maximizing the objective function (log-likelihood) with respect to $θ$ $\hat{θ} = θ argmax l (θ)$

Examples

Word Embedding (Skip-Gram Model)

In the context of word embeddings, particularly the Skip-gram model, the unknown distribution $p_{m} (\cdot; θ)$ becomes $p (c ∣ w; θ)$ , the probability of a context word $c$ given an input word $w$ . The $θ$ represents the parameters of the word embedding model.

The noise distribution $p_{n} (c) = \frac{cnt ( c ) ^{3/4}}{\sum _{c^{'} \in C} cnt ( c ^{'} )}$ , where $cnt (c)$ is the count of word $c$ in the corpus, is the unigram distribution of contexts raised to the power of $\frac{3}{4}$ to balance frequent and rare words.

For each word-context pair $(w, c)$ from the true data, we sample $k$ negative examples from $p_{n} (\cdot)$ . The NCE objective function is derived as: $L_{NCE} = (w, c) \in D \sum [ln (\frac{p _{m} ( c ∣ w ; θ )}{p _{m} ( c ∣ w ; θ ) + k p _{n} ( c )}) + c^{'} \sim p_{n} (c) \sum k ln (\frac{k p _{n} ( c ^{'} )}{p _{m} ( c ^{'} ∣ w ) + k p _{n} ( c ^{'} )})]$ where $D$ is the set of observed word-context pairs, and $k$ is the number of noise samples per data sample.

In practice, $p_{m} (c ∣ w)$ is often modeled as $exp (v_{w} v_{c})$ , without explicit normalization. By maximizing this objective, we obtain word embeddings (vectors $v_{w}$ and $v_{c}$ ) that can discriminate between true context words and random noise words, effectively capturing semantic relationships in the vector space.
Link to original

Negative Sampling
Definition

Negative sampling (NS) is a simplified version of NCE. It shares the core idea of transforming the problem of density estimation into binary classification between true data samples and noise samples.

In the context of word embeddings, particularly the Skip-gram model, NS aims to learn good word representations without the need to estimate a full probability distribution. The goal is to find parameters $θ$ that best represent the relationship between words and their contexts. Given a word-context pair $(w, c)$ from the true data distribution, NS samples $k$ negative examples (noise) from a noise distribution $p_{n} (c)$ . The objective is to maximize the probability of the true pair while minimizing the probability of the noise pairs.

NS models $p (D = 1∣ w, c; θ)$ with the Sigmoid Function: $p (D = 1∣ w, c; θ) = σ (v_{c} \cdot v_{w}) = \frac{1}{1 + e x p ( - v _{c} \cdot v _{w} )}$

The NS objective function is derived as: $L_{NS} = (w, c) \in D \sum ln σ (v_{c} \cdot v_{w}) + (w, c) \in D^{'} \sum k ln σ (- v_{c} \cdot v_{w})$ where $D^{'}$ is the set of negative samples drawn from the noise distribution.

The NS objective function is similar to the Skip-gram objective but replaces the expensive Softmax Function with a simpler binary classification task between true and noise samples.
Link to original

Global Vectors for Word Representation
Definition

Global vectors for word representation (GloVe) model learns word embedding vectors such that their Dot Product is proportional to the logarithm of the words’ probability of co-occurrence.

$w_{i}^{⊺} w_{j} = ln p (X_{i, j})$ where $w_{i}$ and $w_{j}$ are word vectors, $X_{i, j}$ is the co-occurence count between words $i$ and $j$ .
Link to original

Transformer Models

Transformer
Definition

Transformer model uses self-attention, the Attention in which the Q, K, and V derived from the same source, for sentences. The result vector of the self-attention reflects its context. Usually, self-attention is repreated multiple times to further contexualize.

Architecture

Self-Attention

The initial query (Q), key (K), and value (V) are matrices are the result of linear transformation of the input sequence.

Multi-head Self-Attention

The Transformer uses multiple attention heads in parallel like the channel in CNN, allowing it to focus on different aspects of the input simultaneously. The output of multi-head attention is a concatenation of the outputs from individual attention heads, followed by a linear transformation.

$MultiHead (Q, K, V) = concat (head_{1}, head_{2}, \dots, head_{h}) W^{O}$ where $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

Feed-Forward Layer

Each layer in the Transformer also contains a feed-forward layer applied to each position separately, i.e. there is no cross-token dependency. The linear transformations are the same across different positions in the same layer, but different in other layers. $FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

Positional Encoding

Since the Transformer doesn’t inherently capture sequence order, positional encodings are added to the input embeddings. These are typically sine and cosine functions of different frequencies:
$PE_{(\text{pos},2i)} &= sin(\text{pos} / 10000^{2i/d_{\text{model}}})\\ PE_{(\text{pos},2i+1)} &= cos(\text{pos} / 10000^{2i/d_{\text{model}}}) \end{aligned}$$ ## Masked Multi-Head Self-Attention In the decoder, the self-attention layer is modified to prevent attending to later positions. This is achieved by masking future positions with negative infinity before the [[Softmax Function|softmax]] step. ## Encoder-Decoder Attention The decoder has an additional attention layer that performs multi-head attention over the output of the encoder. Where the query (Q) comes from the previous layer in the decoder, and the key (K) and value (V) come from the output of the encoder.$ Link to original

BERT
Definition

BERT model appended a CLS token to the input, and uses it as the aggregated embedding. The model learn word embedding by solving the masked token prediction and next sentence prediction problems.

Tasks

Masked Language Modeling

Figuring out the hidden words using the context.

Next Sentence Prediction

A binary classification problem, predicting if the two sentences in the input are consecutive or not.
Link to original

Image Tranformer Models

Vision Transformer
Definition

Vision transformer (ViT) applies Transformer architecture to the vision tasks. The model considers an image as a sequence of patches.

Architecture

Positional Enbedding

ViT does not use pre-designed positional encoding, it leaves it as a learnable parameter. By doing so, ViT does not imply any inductive bias unlike to CNN
Link to original

Data-Efficient Image Transformer
Definition

Data-efficient image transformer (DeiT) appended a distillation token to a ViT model and compare the token to the prediction of pre-trained teacher model (CNN).

The teacher model and the distillation process in DeiT help the student model (ViT) to generalize better by providing additional supervision from the teacher model. The teacher model’s predictions implicitly contain information about the model’s uncertainties and the relationships between classes.

Architecture

Soft-Label Distillation

Minimizes the KL-Divergence between the softmax label of the teacher model and that of the student.

Hard-Label Distillation

Take the hard decision of the teacher as a true label.
Link to original

Swin Transformer
Definition

Swin transformer reintroduced the inductive bias by using the local self-attention. The problems of the local self-attention are solved with the hierarchical structure, shifted window partitioning, and relative position bias.

Architecture

Local Self-Attention

The self-attention is computed within local windows instead of globally, reducing computational cost.

Hierarchical Structure

Swin Transformer starts with small-sized patches and gradually merges neighboring patches in deeper layers, similar to CNN. This allows the model to capture both fine and coarse features effectively.

Patch Merging:

To reduce the spatial resolution and increase the channel dimension, Swin Transformer uses a patch merging layer. The layer concatenates the features of each $2 \times 2$ neighboring patches and applies a linear transformation. The output feature represents the merged $2 \times 2$ patch and has doubled patch size.

Shifted Window Partitioning

The image is divided into non-overlapping windows, these windows are shifted in alternate layers for cross-window information exchange.

Relative Position Bias

Instead of using absolute position embeddings like in the original Transformer, Swin Transformer uses a relative position bias. This is a learnable parameter added to the attention scores before the softmax operation.

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊺}}{d _{k}} + B) V$
Link to original

Convolutional Vision Transformer
Definition

Convolutional vision transformer (CvT) uses convolutional layers instead of the fully connected layers to improve its performance and efficiency.

Architecture

Convolutional Token Embedding

CvT replaced the patch embedding used in ViT with a convolutional layer.

Convolutional Projection

Instead of using linear transformation to project the input into query (Q), key (K), and value (V), convolutional layers are used

Squeezed Convolutional Projection

The squeezed convolutional projection is utilized to reduce computational complexity or to downsample the feature map.
Link to original

Video Tranformer Models

Video Vision Transformer
Definition

Video vision transformer (ViViT) extended the idea of ViT to video classification task.

Architecture

Uniform Frame Sampling

The fixed number of frames of the input video are uniformly sampled to handle videos of varying lengths and to reduce computational complexity.

Tublet Embedding

The input video is divided into a sequence of non-overlapping tubelet tokens. Each tubelet is a 3D patch of the video with dimensions (frames, height, width). The tubelet tokens are linearly projected to obtain embedding vectors. Positional embeddings are added to provide spatial and temporal information.

Model Variants

ViViT paper proposed four different model variants.

Spatio-Temporal Attention

This is the most direct extension of the ViT to video. It treats the entire video as a single stream of tokens, applying self-attention across both spatial and temporal dimensions simultaneously.

Factorized Encoder

This variant uses two separate transformer encoders: one for spatial, and another for temporal. Each frame is fed into the spatial encoder (ViT), the sequence of the output CLS token is used as an input of the temporal encoder (Transformer).

Factorized Self-Attention

This variant factorizes the self-attention operation within each transformer layer into spatial and temporal attention operations. The spatial attention is applied to tokens within the same frame, and the temporal attention is applied to tokens at the same spatial location across frames.

Factorized Dot-Product Attention

This variant factorizes the multi-head dot-product attention. The half of the heads only compute the spatial attention, and the other half only compute the temporal attention.
Link to original

Multiscale Vision Transformer
Definition

Multiscale vision transformer (MViT) applied the idea of CvT to the video inputs. Additionally, the multi head pooling attention is used for hierarchical structure.

Architecture

Pooling Attention

Pooling attention decreases the spatial dimension and increases the channel dimension of the Q, K, and V before applying self-attention. It is processed by the convolutional layer with a stride greater than 1.
Link to original

Object Detection

Proposal-Based Approaches

R-CNN
Definition

Regions with CNN features (RCNN) model performs object detection in two stages: region proposal and object recognition.

Region Proposal

The region proposal is performed by the off-the-shelf model. Around 200 regions are proposed.

Object Recognition

Extract CNN feature of the proposed image patch and map the features to labels using classifier (SVM).

Bounding-Box Regression

After the classifier, the bounding-box regressor is applied. The regressor takes the CNN features of the proposed region and predicts a refined bounding box. The refinement aims to adjust the original region proposal to better fit the actual object boundaries. A separate bounding-box regressor is trained for each object class.
Link to original

Fast R-CNN
Definition

Fast R-CNN improved the speed of object detection by extracting the CNN feature of the entire image only once and using a fraction of the feature map corresponding to the proposed bounding box for the object recognition stage. The fractions of the feature map are resized into the same size using ROI pooling.

Architecture

RoI Pooling

RoI pooling is a type of max pooling transforms features inside region of interest into a small feature map with fixed size. The RoI is divided into an $H \times W$ grid of sub-windows, and for each sub-window, perform max pooling.
Link to original

Faster R-CNN
Definition

Faster R-CNN substituted the off-the-shelf region proposal of R-CNN with the region proposal network (RPN)

Architecture

Region Proposal Network

Region proposal network (RPN) is a CNN that operates on the feature maps produced by the backbone CNN. RPN slides over the CNN feature map. At each location, it makes multiple region proposals have various sizes and aspect ratios. For each proposal, the RPN predicts the objectness score and the bounding box refinement.
Link to original

Proposal-Free Approaches

You Only Look Once
Definition

You Only Look Once (YOLO) is a proposal-free object detection model, that predict bounding boxes and class probabilities from an image in a single evaluation.

Architecture

Input image is divided into a grid, and each grid cell produces bounding boxes centered within it and predicts class probability.

YOLO: Divide the input image into a grid ( $S \times S$ ), each grid cell produces bounding boxes and predicts class probability. And drop low-probable bounding boxes and make a final result.

The final output is a tensor of shape $S \times S \times (B \times 5 + C)$ , where $5$ represents (x, y, w, h, confidence) for each bounding box.
Link to original

Single Shot MultiBox Detector
Definition

Single shot multi-box detector (SSD) model is a proposal-free object detection model that uses multiple feature maps from different layers to detect various-sized objects.

Architecture

The number of detections: $(38 \times 38 \times 4) + (19 \times 19 \times 6) + (10 \times 10 \times 6) + (5 \times 5 \times 6) + (3 \times 3 \times 4) + (1 \times 1 \times 4) = 8732$

SSD uses a set of default boxes with different scales and aspect ratios at each location in the feature maps. CNN (Convolutional predictors) are applied to feature maps to produce class scores and box offsets for each default box.

Convolutional predictors

The convolutional predictors in SSD are designed to predict class scores and the bounding box offsets. The classification filter has the shape $3 \times 3 \times p \times k (c + 4)$ , where $p$ is the number of input channels, $c$ is the number of classes, and $k$ is the number of default boxes per location.
Link to original

Detection Transformer
Definition

Detection Transformer (DETR) uses Transformer encoder-decoder architecture for the object detection problem.

Architecture

DETR extracts CNN features from the input image using a backbone CNN network. The feature map is flattened and fed into the encoder after adding positional encodings. The decoder takes the encoded image features and a set of learnable object queries as inputs. The decoder output embeddings correspond to the objects to be detected in the image. The output of the decoder is passed through two prediction heads: classification head, box head.
Link to original

Segmentation

Semantic Segmentation

Deconvolutional Network
Definition

Deconvolutional Network uses auto-encoder-like CNN] structure. In the upsampling stage, the model use deconvolution layers.

Deconvolutional Layer

Deconvolutional layers are used to increase the spatial dimensions of input features. The values of the input feature map are used as scalars multiplied by the learnable deconvolutional filters. With strides greater than 1, the output of the deconvolutional layer is enlarged and densified.
Link to original

U-Net
Definition

U-Net is an encoder-decoder structured CNN used for image segmentation.

Architecture

Encoder-Decoder

The encoder extracts spatial patterns using regular convolutional layers. Each downsampling step doubles the number of channels. The decoder upsamples the input feature map with $2 \times 2$ deconvolution that halves the number of channels.

Skip Connections

The feature maps from the layers of the encoder are cropped and concatenated with the upsampled feature maps in the layers of the decoder. This allows the network to combine low-level features with high-level features, preserving detail.
Link to original

Instance Segmentation

Mask R-CNN
Definition

Mask R-CNN extends Faster R-CNN by adding mask prediction branch for instance segmentation problems.

Architecture

RoI Align

Mask R-CNN replaces RoI pooling used in Faster R-CNN to RoI align. RoI align is performed to accurately align extracted features with the input image It computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map.

Mask Head

Mask prediction head generates a binary mask for each RoI using the aligned feature.

The mask head is composed of convolutional layers and deconvolutional layers.
Link to original

Transformer-Based Segmentation Models

Segmenter
Definition

Segmenter is a Transformer-based semantic segmentation model.

Architecture

The encoder of Segmenter has a ViT architecture.

The decoder takes the output of the encoder and the class tokens as inputs. The embedded token matrix ( $N \times D$ ) and the class mask matrix ( $K \times D$ ) are multiplied and Softmax Function is applied on the class dimension to obtain pixel-wise class scores, where $N$ is the number of patches, $D$ is the channel dimension, $K$ is the number of classes.
Link to original

Dense Prediction Transformer
Definition

Dense prediction transformer (DPT) reassembles features at multiple resolutions unlike the Segmenter which only uses fixed-size patches.

Architecture

Reassemble

$Resample_{s} : R^{\frac{H}{p} \times \frac{W}{p} \times D} \to R^{\frac{H}{s} \times \frac{W}{s} \times \hat{D}}$

The resample operation reshapes the flatten feature vectors back to image-like dimensions, change the spatial and feature sizes using the $1 \times 1$ convolutional layer followed by a $3 \times 3$ convolution or deconvolutional layer. In the process, the CLS token is specially processed (ignored, added, or concatenated and projected)

Residual Convolutional Unit

The output of the transformer is concatenated to the fusion module after passing through the residual convolutional unit.

Fusion

The fusion module concatenates the feature of the previous layer and the residual convolution unit, double the size of the feature map, and bring the channel size back to the input size.
Link to original

Metric Learning

Discounted Cumulative Gain
Definition

Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval.

$D C G_{p} = i = 1 \sum p \frac{rel _{i}}{l o g _{2} ( i + 1 )}$ $D C G_{p} = i = 1 \sum p \frac{2 ^{rel_{i}} - 1}{l o g _{2} ( i + 1 )}$ where $rel_{i}$ is the graded relevance of the result at position $i$ .

Normalized DCG

DCG is often normalized so that it is comparable across queries, giving Normalized DCG (NDCG). NDCG is DCG normalized by the maximum possible DCG of the result set.
Link to original

Triplet Loss
Definition

Triplet loss is a Loss Function where a reference input (anchor) is compared to a matching input (positive) and a non-matching input (negative). The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized.

$L = i = 1 \sum n [∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} - ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2} + α]_{+}$ where $α$ is a margin between positive and negative pairs
Link to original

Negative Mining
Definition

In the context of metric learning, negative mining involves selecting the most informative negative examples to train the model well. Instead of randomly sampling negative examples, we focus on hard negatives, those that are closest to the anchor in the embedding space but belong to a different class. For each anchor, compute distances to all or a subset of negative examples, and select the negative examples that are closest to the anchor.

Online Negative Mining

Online negative mining looks for negatives from the current batch, instead of using one initially assigned.

Semi-Hard Negative Mining

Semi-hard negative mining selects negative examples that are farther from the anchor than the positive example, but still within a certain margin.

If $d (x^{a}, x^{p}) < d (x^{a}, x^{n}) < d (x^{a}, x^{p}) + α$ , then consider $x^{n}$ as a semi-hard negative
Link to original

Collaborative Deep Metric Learning
Definition

Collaborative Deep Metric Learning (CDML) construct a simple graph whose nodes are contents, edges are existence of relationship (e.g. co-watched video). The model trains the embedding minimizing Triplet Loss, where the negative samples are randomly selected in the batch.
Link to original

Pairwise Loss
Definition

The pairwise loss compares pairs of items and determines their relative order. Instead of predicting absolute errors for individual items.

$L (W, Y, X_{1}, X_{2}) = loss for similar pairs (1 - Y) L_{S} (D_{W}) + loss for dissimilar pairs Y L_{D} (D_{W})$ where $Y$ is a binary label if a pair is similar then $0$ , dissimilar then $1$ , $D_{W}$ is a distance metric between the pair, parametrized by $W$ .
Link to original

Contrastive Learning
Definition

Contrastive learning is a kind of metric learning that learns representations of data by comparing similar and dissimilar samples. The model is trained to recognize that certain data points are related (positive pairs) or not (negative pairs). It helps the model learn meaningful features and representations without requiring explicit labels for every data point.
Link to original

SimCLR
Definition

SimCLR is a self-supervised visual embedding model trained with contrastive loss. For each image twice, some data augmentation is applied, creating two augmented images. The pairs are used as positive pairs and the other pairs in the minibatch are used as negative pairs.

Architecture

Data Augmentation Operators

Contrastive Loss

$l_{i, j} = - ln \frac{e x p ( sim ( z _{i} , z _{j} ) / τ )}{k = 1 \neq = i \sum 2 n e x p ( sim ( z _{i} , z _{k} ) / τ )}$
Link to original

Multimodal Learning

Image Captioning Task

Show, Attend, and Tell
Definition

Show, attend, and tell (SAT) is a Attention based image captioning model.

Architecture

The model has LSTM structure. The feature map is extracted from the input image using CNN and it is fed into the Attention LSTM, where the query (Q) is the previous hidden state, and key (K) and value (V) are the vectors of the feature map.
Link to original

Video Captioning Task

Describing Videos by Exploiting Temporal Structure
Definition

The model uses an LSTM architecture with Attention for video captioning task.

Architecture

The model has LSTM structure. The feature maps are extracted from the input frames using CNN and it is fed into the Attention LSTM, where the query (Q) is the previous hidden state, and key (K) and value (V) are the extracted feature maps.
Link to original

Transformer-Based Image-Text Models

Visual-Linguistic-BERT
Definition

Visual-Linguistic-BERT (VL-BERT) extends the BERT architecture for video captioning task.

Architecture

The model appended visual feature embedding to the BERT architecture. The text part is trained in the same way as the BERT’s masked language modeling, and the image part is trained to estimate the label of the masked visual feature.

Visual Feature Embedding

The visual features embedding of the text tokens are obtained from the entire input image, and those of the image tokens are obtained from the objects estimated by an object detection model (Faster R-CNN).
Link to original

Vision-and-Language BERT
Definition

Vision-and-Language BERT (ViLBERT) extends the BERT architecture to handle both visual and textual inputs.

Architecture

ViLBERT model consists of two parallel BERT-like streams: visual stream and textual stream. The two stream are connected through co-attention transformer layers. The tokens of the visual stream are objects estimated by an object detection model (Faster R-CNN). VilBERT is pre-trained on image-caption pairs using two main tasks: masked multimodel learning and multimodel alignment prediction.

Co-Attention Transformer Layer

The co-attention transformer layer allows for bidirectional interaction between the visual and textual streams. In the layer, each stream uses the feature of another stream as key (K) and value (V).

Masked Multimodel Learning

The text stream is trained in the same way as the BERT’s masked language modeling, and the image part is trained to estimate the label of the masked visual feature.

Multimodel Alignment Prediction

The model takes image-text pairs as input and determines if the image and text pair match.
Link to original

Transformer-based Video-Text Models

VideoBERT
Definition

VideoBERT extends the BERT architecture to handle both video and text data.

Architecture

The frames are sampled from an input video and the CNN (S3D) features of them are extracted and used as the visual tokens.

Linguistic-Visual Alignment Task

The model takes video segment-text pairs as input and determines if the pairs match.

Masked Language Modeling (MLM)

The textual part is trained in the same way as the BERT’s masked language modeling

Masked Frame Modeling (MFM)

The visual part is trained to estimate the cluster of the visual tokens, assigned by K-Means Clustering.
Link to original

Contrastive Bidirectional Transformer
Definition

Contrastive Bidirectional Transformer (CBT) applied Contrastive Learning to the training of BERT for video captioning task.

Architecture

CBT consists of two streams: video and textual stream. The two streams extracts word and frame features respectively and those features are concatenated and pass through the cross-model transformer to make an estimation. The text stream uses pre-trained BERT, and the video part is trained through Contrastive Learning, if two frame features originated from the same video then positive otherwise negative.

Cross-Model Transformer

The cross-model transformer model takes video segment-text pairs as input and determines if the pairs match.
Link to original

Hierarchical Multi-Modal Encoder
Definition

Hierarchical multi-modal encoder (HAMMER) is a model for moment localization tasks (identifying a short segment in a long video corpus that semantically matches a given text query). It simplified the task with Conditional Probability.

Architecture

Two-Stage MLVC (Moment Localization in Video Corpus)

$p (moment ∣ query) \approx top k videos \sum p (moment ∣ video, query) p (video ∣ query)$

The original problem is approximated to the simpler problem using the definition of Conditional Probability

Video Retrieval Task

The video retrieval task is trained through Contrastive Learning by promoting the correct video-query pairs.

Moment Localization in Single Video

The temporal localization task is trained as a classification problem with the three labels: begin, end, and other.

Hierarchical Visual Encoders

The frame encoder takes the frame sequence of a video clip and the query as input, and outputs the contextualized visual frame features. The CLS tokens of the frame encoder fed into the clip encoder, yielding a video representation
Link to original

Audio Modeling

Audio Spectrogram Transformer
Definition

Audio spectrogram transformer (AST) applied ViT architecture to the audio input represented as spectrogram.
Link to original

Visual-Audio-Text Model

Video-Audio-Text Transformer
Definition

Video-audio-text transformer (VATT) designed to process and understand information from video, audio, and text simultaneously.

Architecture

VATT is based on the Transformer architecture. It consists of three separate encoders for video, audio, and text inputs, followed by a joint multimodal encoder. Each modality feature is embedded to a same-sized vector, and fed into a transformer encoder. The model is trained through multimodal Contrastive Learning.

Multimodal Contrastive Learning

The loss of the model consists of the sum of the loss of video and audio, and the loss of text and video. For the text modality, all word embeddings in the sentence are summed before the comparison.
Link to original

Multimodal Metric Learning

CLIP
Definition

Contrastive language-image pre-training (CLIP) model is designed to learn visual concepts from natural language supervision through Contrastive Learning.

Zero Shot Prediction

CLIP can classify images into categories that are not explicitly been trained on, simply by comparing image embeddings with text embeddings of category names.

Architecture

CLIP consists of two main components: a vision encoder(ViT or CNN) and a text encoder (Transformer-based model)

Contrastive Pre-Training

The input image-text pairs are encoded by the corresponding encoder, and the encoders are updated to maximize the similarity between matching pairs and minimize the similarity non-matching pairs. The similarity is measured by cosine similarity of the two encoded vectors.
Link to original

MuLan
Definition

MuLan applied the architecture of CLIP to audio clip (represented as a spectrogram)-caption pairs
Link to original

Generative Models

Likelihood-Based Sequential Approaches

PixelRNN
Definition

PixelRNN assumes that images are sampled from an unknown distribution. The model generate images pixel by pixel, treating the image as a sequence. Each pixel is conditioned on all previously generated pixels. The model is trained to maximize the log-likelihood of the training data.

Architecture

The model uses a multi-layer ConvLSTM
Link to original

PixelCNN
Definition

PixelCNN model substitutes the LSTM used for PixelRNN with a convolutional layer. The model generates the target pixel using previously generated nearby pixels.
Link to original

Pixel Recursive Super Resolution
Definition

The distribution of each pixel is estimated sequentially given previous pixels with PixelCNN and globally given a low-resolution image, and the two estimations are aggregated for the final distribution.
Link to original

Autoencoder-Based Models

Autoencoder
Definition

Autoencoder is a type of Neural Network used for unsupervised dimensionality reduction or feature extraction. It consists of two main parts: encoder and decoder. Encoder compresses the input data into a lower-dimensional representation, and decoder attempts to reconstruct the original input from the compressed representation. Once autoencoder is trained, decoder is no longer used.

Architecture

The model is trained by minimizing the reconstruction error, typically using mean squared error $L (x, g (f (x))) = ∥ x - g (f (x)) ∥^{2}$

Facts

The extracted features may be used to train other supervised models.

Autoencoder can be used for anomaly detection task. Typical examples have low reconstruction error, whereas outliers should have high reconstruction error.

Link to original

Denoising Autoencoder
Definition

$D (x^{'}) = g (f (x^{'})) =\approx x$ where $x^{'} = x + ϵ$ , here $ϵ$ is a random noise, $f$ is an encoder, and $g$ is a decoder.

Denoising autoencoder is a type of Autoencoder designed to learn robust representations of data by reconstructing clean inputs from noisy versions.

Architecture

The model takes clean input $x$ and adds noise to create corrupted input $x^{'}$ . The noisy input is fed into the mdoel, and the model attempts to reconstruct the original clean input by minimizing MSE. $L (x, g (f (x^{'}))) = ∣∣ x - g (f (x^{'})) ∣ ∣^{2}$

In a Gaussian noise setting, the estimator minimizing MSE is the mean of posterior distribution $E [x ∣ x^{'}]$ , so DAE learns a function that approximates the posterior mean $D (x^{'}) \approx E [x ∣ x^{'}]$ . Therefore, using a DAE with Tweedie’s Formula, we can estimate the Score Function of a sample space.
Link to original

Variational Autoencoder
Definition

Variational autoencoder (VAE) utilizes the decoder of an Autoencoder as generator. Unlike traditional autoencoders, VAE models the latent space as a probability distribution, typically a Multivariate Normal Distribution.

Architecture

Instead of outputting a single point in latent space, the encoder of VAE produces parameters of a probability distribution on the latent space. The latent vector is sampled from the distribution and is reconstructed by the decoder.

Loss Function (ELBO)

The loss of VAE (also called an evidence lower bound or ELBO) is consists of a reconstruction loss and a KL-Divergence which ensures the latent distribution is close to a normal distribution.
$\ln p_{\theta}(x) &= \ln\sum\limits_{z}p_{\theta}(x|z)p_{\theta}(z) \\ &= \ln\sum\limits_{z}p_{\theta}(x|z)\frac{p_{\theta}(z)}{q_{\theta}(z|x)}q_{\theta}(z|x) \\ &= \ln\mathbb{E}_{z \sim q_{\theta}(z|x)}\left[ p_{\theta}(x|z)\frac{p_{\theta}(z)}{q_{\theta}(z|x)} \right]\\ &\geq \mathbb{E}_{z \sim q_{\theta}(z|x)}\ln\left[ p_{\theta}(x|z)\frac{p_{\theta}(z)}{q_{\theta}(z|x)} \right] =: ELBO\\ &= \underbrace{\mathbb{E}_{z\sim q_{\theta}(z|x)}[\ln p_\theta(x|z)]}_{\text{reconstruction loss}} - \underbrace{D_{KL}(q_\theta(z|x) \| p_\theta(z))}_{\text{regularization loss}} \end{aligned}$$ where $q(z|x)$ is the encoder distribution, $p(x|z)$ is the decoder distribution, and $p(z)$ is the prior distribution (usually $\mathcal{N}(0, I)$). The regularization term ensures the continuity and completeness of the latent space. In terms of variational inference, we can see $p_{\theta}(z)$ as a latent distribution, and $q_{\theta}(z|x)$ as a proposal distribution used for [[Importance Sampling]]. i.e. $q_{\theta}(z|x)$ predicts the probable region in the latent space which likely to have generated the observation $x$. ## Reparameterization Trick ![[Pasted image 20240912191141.png|600]] We can not compute gradients for the operations containing a random variable. So, instead of directly sampling from the distribution $N(\mu, \sigma)$, we randomly sample from $\epsilon \sim N(0, 1)$ and make a latent vector $z = \mu + \sigma \epsilon$. When calculating the gradient in the backpropagation, the sampled $\epsilon$ is considered as a constant ($\cfrac{dz}{d\mu} = 1$ and $\cfrac{dz}{d\sigma} = \epsilon$). # Facts > ![[Pasted image 20240910130423.png|800]] > > The data encoded by VAE is semantically well-distinguished in a latent low-dimensional space.$ Link to original

GAN-Based Models

Generative Adversarial Networks
Definition

Generative adversarial network (GAN) is a type of Neural Network model generating data that resembles training data.

Architecture

Training

In the training process of GAN, the two neural networks compete against each other. The generator (G) generates fake data and the discriminator (D) tries to distinguish between real and fake data.

The generator creates fake data from random noise.

The discriminator is shown both real data from the training set and fake data from the generator. It tries to classify which is which.

Based on how well the discriminator performs, both networks receive feedback

The generator aims to improve its fake data to fool the discriminator.

The discriminator aims to get better at distinguishing real from fake.

This process continues iteratively, with both networks improving over time.

Objective Function

The objective function of GAN is defined as $min_{θ_{G}} \approx D_{J S} (p_{r}, p_{z}) θ_{D} max [E_{x \sim p_{r} (x)} [ln D (x)] + E_{z \sim p (z)} [ln (1 - D (G (z)))]]$ where:

$G$ is the generator

$D$ is the discriminator

$p_{r}$ is the distribution of the input data

$p_{z}$ is the distribution of noise

Facts

When the discriminator is optimal, minimizing the objective function with respect to the generator is equivalent to minimizing the JS-Divergence between the real data distribution $p_{r} (x)$ and the generated distribution $G (z), z \sim p_{z} (z)$ .
Link to original

Deep Convolutional GAN
Definition

DCGAN is a specific architecture of GAN. It was designed to improve the stability of GAN training and the quality of generated image. DCGANs use convolutional and deconvolutional layers in the discriminator and generator, respectively.
Link to original

Progressive Growing GAN
Definition

Progressive growing GAN (PGGAN) is a specific architecture of GAN. The model addresses the challenges of traditional GAN by growing both the generator and discriminator, from low-resolution to high-resolution, progressively throughout the training process.

Architecture

The generator and discriminator are symmetric. New layers are added symmetrically to both networks as resolution increases. Convolutional layers are used throughout the network.

Training

Training starts with both D and G having a very low resolution. During training, gradually add new layers to both the D and G to increase the resolution.
Link to original

Conditional GAN
Definition

Conditional GAN (cGAN) is an extension of GAN that allows for the generating of data with specific attributes or conditions. In a standard GAN, the generator only takes random noise, while in a cGAN, the generator receives additional information to guide the generation process.

Architecture

The generator learns to create samples that match the given condition, and the discriminator learns to distinguish between real and fake samples, considering the condition.

Objective Function

The objective function of cGAN is defined as $min_{θ_{G}} max_{θ_{D}} [E_{x \sim p_{r} (x)} [ln D (x ∣ c)] + E_{z \sim p (z)} [ln (1 - D (G (z ∣ c)))]]$ where:

$G (z ∣ c)$ is the generator’s output given noise and condition

$D (x ∣ c)$ is the discriminator’s output given input and condition

$c$ is the condition

$p_{r}$ is the distribution of the input data

$p_{z}$ is the distribution of noise

Link to original

InfoGAN
Definition

Information maximizing GAN (InfoGAN) is an extension of the GAN architecture that aims to learn disentangled representations in an unsupervised manner. InfoGAN introduces a latent code $c$ in addition to the noise vector $z$ . This latent code is designed to capture interpretable and meaningful features of the generated data.

Architecture

The core idea of InfoGAN is to maximize the Mutual Information between the latent code $c$ and the generated samples $G (z, c)$ . This encourages the model to learn representations $c$ corresponding to semantic features of the generated output. InfoGAN introduces an auxiliary network $Q$ that attempts to recover the latent code $c$ from the generated samples.

Objective Function

GAN loss: $L_{G A N} = E_{x \sim p_{r} (x)} [ln D (x)] + E_{z \sim p_{n} (z), c \sim p_{n} (c)} [ln (1 - D (G (z, c)))]$ Mutual Information $I (c; G (z, c))$ and variational lower bound $V (G, Q)$ : $I (c; G (z, c)) \geq V (G, Q) = E_{z \sim p_{n} (z), c \sim p_{n} (c)} [ln Q (c ∣ G (z, c))] + H (c)$ where:

$G$ is the generator

$D$ is the discriminator

$Q$ is the auxiliary network

$p_{r}$ is the distribution of the input data

$p_{n}$ is the distribution of noise

$H (c)$ is the Entropy of the latent code distribution $p_{n} (c)$

The full objective function for InfoGAN is defined as: $min_{θ_{G} θ_{Q}} max_{θ_{D}} [L_{G A N} (D, G) - λV (G, Q)]$ Where:

$L_{G A N}$ is the standard GAN objective

$V$ is the approximation of the mutual information between $c$ and $G (z, c)$

$λ$ is a hyperparameter controlling the importance of the mutual information term

Link to original

Semi-Supervised GAN
Definition

Semi-supervised GAN (SGAN) is a variation of the GAN that takes both labeled and unlabeled data in the training process. The model aims to create a data-efficient classifier and to improve the generating quality.

Architecture

Objective Function

The objective function of SGAN consists of four losses: supervised loss, unsupervised loss, generator loss, and feature matching loss. The supervised loss is calculated only for labeled samples, and aims to correctly classify the real samples into their correct class. The unsupervised loss is calculated for both unlabeled real samples and generated samples, and encourages the discriminator to distinguish between real and fake samples. The generator loss is used to ensure the generated images are realistic. The feature matching loss encourages the generator to produce samples have similar feature representation to real data in the discriminator’s intermediate layer.

Supervised loss: $L_{D_{supervised}} = - labeled real data E_{(x, y) \sim p_{r} (x, y)} [ln (p_{model} (y = i ∣ x, i \leq k))]$ Unsupervised loss: $L_{D_{unsupervised}} = - unlabeled real data E_{(x) \sim p_{r} (x)} [ln (1 - p_{model} (y = k + 1∣ x))] - generated fake image E_{(z) \sim p_{z} (z)} [ln (p_{model} (y = k + 1∣ G (z)))]$ Generator loss: $L_{G_{g e n er a t or}} = - E_{z \sim p_{z} (z)} [ln (1 - p_{model} (y = k + 1∣ G (z)))]$ Feature matching loss $L_{G_{f e a t u re ma t c hin g}} = ∣∣ E_{x \sim p_{data} (x)} f (x) - E_{z \sim p_{z} (z)} f (G (z)) ∣ ∣_{2}^{2}$ where:

$G$ is the generator

$D$ is the discriminator

$p_{model} (y ∣ x, y \leq k)$ is the model’s predicted probability for the correct class.

$p_{model} (y = k + 1∣ x)$ is the model’s predicted probability for the fake class.

$p_{r} (x, y)$ is the distribution of the labeled data

$p_{r} (x)$ is the distribution of the data (both labeled and unlabeled)

$p_{z} (z)$ is the distribution of noise

$f (x)$ is the intermediate layer of the discriminator

The full objective function for SGAN is defined as: $min_{θ_{G}} max_{θ_{D}} [L_{D_{supervised}} + L_{D_{unsupervised}} + L_{G_{g e n er a t or}} + λ L_{G_{f e a t u re ma t c hin g}}]$ where:

$θ_{G}$ and $θ_{D}$ are the parameters of the generator and discriminator, respectively

$λ$ is a hyperparameter that controls the weight of the feature matching loss.

Link to original

Wasserstein GAN
Definition

Wasserstein GAN (WGAN) is a variant of GAN that uses the Wasserstein Distance instead of the Jensen-Shannon Divergence used in traditional GAN. The Wasserstein distance provides a smoother gradient everywhere.

Architecture

In WGAN, the discriminator of traditional GAN is replaced by a critic that is trained to approximate the Wasserstein Distance. The critic outputs a real number instead of a probability.

Since Wasserstein Distance is highly intractable, the cost function is simplified using Kantorovich-Rubenstein Duality requiring 1-Lipschitz continuous. To satisfy the condition the weights of the critic $f$ are clipped.

Objective Function

The objective function of WGAN is defined as $min_{θ_{G}} \approx W (P_{r}, p_{z}) θ_{f} max [E_{x \sim p_{r} (x)} [f (x)] - E_{z \sim p_{z} (z)} [f (G (z))]]$ where:

$G$ is the generator

$f$ is the critic

$p_{r}$ is the distribution of the input data

$p_{z}$ is the distribution of noise

WGAN-GP

Instead of clipping the weights, WGAN-GP penalizes the model if the gradient norm moves away from its target norm value $1$ .

The additional gradient penalty term of WGAN-GP $λ E_{x \sim p_{x} (x)} [(∣∣ \nabla_{θ_{f}} f (x) ∣ ∣_{2} - 1)^{2}]$ where $f (x) = E_{x \sim p_{r} (x)} [f (x)] - E_{z \sim p_{z} (z)} [f (G (z))]$ is the critic loss.

This enforces the Lipschitz constraint more effectively than weight clipping.

Algorithm

$α$ : the learning rate, $c$ : the clipping parameter, $m$ : the batch size, $n_{critic}$ : the number of iterations of the critic per generator iteration. $θ_{f, 0}$ : the initial critic parameters. $θ_{G, 0}$ : the initial generator’s parameters.

While $θ$ has not converged:

for $t = 0, \dots, n_{critic}$ :

Sample ${x^{(i)}}_{i = 1}^{m} \sim p_{r} (x)$ a batch from the real data.

Sample ${z^{(i)}}_{i = 1}^{m} \sim p_{z} (z)$ a batch from the noise distribution.

$g_{θ_{f}} = \nabla_{θ_{f}} [\frac{1}{m} i = 1 \sum m f (x^{(i)}) - \frac{1}{m} i = 1 \sum m f (G (z^{(i)}))]$

$θ_{f, t + 1} = w + α RMSProp (θ_{f, t}, g_{θ_{f}})$

$θ_{f, t + 1} = clip (θ_{f, t + 1}, - c, c)$

Sample ${z^{(i)}}_{i = 1}^{m} \sim p_{z} (z)$

$g_{θ_{G}} = - \nabla_{θ_{G}} \frac{1}{m} i = 1 \sum m f (G (z^{(i)}))$

$θ_{G, t + 1} = θ - α RMSProp (θ, g_{θ_{G}})$

Link to original

Gan-Based Image-to-Image Translation Models

Pix2pix
Definition

Pix2pix model is a Conditional GAN model designed for image-to-image translation tasks.

Architecture

The generator of the model has a U-Net architecture. The pix2pix model takes an input image and learns to generate a corresponding output image. The training set consists of pairs of two images ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$

Objective Function

The objective function of Pix2pix model is defined as $min_{θ_{G}} max_{θ_{D}} [Adversarial Loss E_{y \sim p_{y} (y)} [ln D (x, y)] + E_{x \sim p (x)} [ln (1 - D (x, G (x)))] + Reconstruction Loss E_{x \sim p_{x} (x), y \sim p_{y} (y)} [∣∣ y - G (x) ∣ ∣_{1}]]$
Link to original

CycleGAN
Definition

Cycle GAN is unsupervised image-to-image translation model. The model can be trained without paired training examples.

Architecture

CycleGAN consists of two GANs $G_{x \to y} : X \to Y$ and $G_{y \to x} : Y \to X$ , and is trained by minimizing the cycle consistency loss for each image from both domain $X$ and $Y$ $G_{y \to x} (G_{x \to y} (x)) \approx x and G_{x \to y} (G_{y \to x} (y)) \approx y$

Objective Function

The objective function of CycleGAN consists of three losses: adversarial loss ( $x \to y$ ), adversarial loss ( $y \to y$ ), and cycle-consistency loss.

Adversarial loss ( $x \to y$ ): $L_{GAN} (G_{x \to y}, D_{y}, X, Y) = E_{y \sim p_{y} (y)} [ln D_{y} (y)] + E_{x \sim p_{x} (x)} [ln (1 - D_{y} (G_{x \to y} (x)))]$ Adversarial loss ( $y \to x$ ): $L_{GAN} (G_{y \to x}, D_{x}, Y, X) = E_{x \sim p_{x} (x)} [ln D_{x} (x)] + E_{y \sim p_{y} (y)} [ln (1 - D_{x} (G_{y \to x} (y)))]$ Cycle-consistency loss: $L_{cycle} (G_{x \to y}, G_{Y \to x}) = E_{x \sim p_{x} (x)} [∣∣ x - G_{y \to x} (G_{x \to y} (x)) ∣ ∣_{1}] + E_{y \sim p_{y} (y)} [∣∣ y - G_{x \to y} (G_{y \to x} (y)) ∣ ∣_{1}]$ where:

$G$ is the generator

$D$ is the discriminator

$p_{x} (x)$ is the distribution of the data from domain $X$

$p_{y} (y)$ is the distribution of the data from domain $Y$

The full objective function for CycleGAN can be written as:

$min_{θ_{G_{x \to y}}, θ_{G_{y \to x}}} max_{θ_{D_{x}}, θ_{D_{y}}} [Adversarial Loss (x \to y) L_{GAN} (G_{x \to y}, D_{y}, X, Y) + Adversarial Loss (y \to x) L_{GAN} (G_{y \to x}, D_{x}, Y, X) + λ Cycle-Consistency Loss L_{cycle} (G_{x \to y}, G_{Y \to x})]$ where $λ$ is the weight for the cycle consistency loss.
Link to original

StarGAN
Definition

StarGAN is a GAN model designed for multi-domain image-to-image translation tasks. Unlike previous models that required separate networks for each domain pair, StarGAN can perform image translations across multiple domains using a single generator network.

Architecture

The generator takes an input image and a target domain label to produce the translated image, and the discriminator network that not only distinguishes between real and fake images but also classifies the domain of the input image.

Mask Vector

StarGAN introduces a mask vector to handle datasets with partial domain labels, allowing it to ignore unspecified labels during training.

Objective Function

The objective function of StarGAN consists of three losses: adversarial loss, domain classification loss, and reconstruction loss. The adversarial loss ensures the generated images are realistic, the domain classification loss helps the model learn domain-specific features, and the reconstruction loss ensures the original image can be reconstructed when translating back to the source domain.

Adversarial loss: $L_{adv} (G, D) = E_{x \sim p_{r} (x)} [lo g D_{src} (x)] + E_{x \sim p_{r} (x)} [lo g (1 - D_{src} (G (x, c)))]$ Domain classification loss: $L_{cls} (G, D) = E_{x \sim p_{r} (x)} [- lo g D_{cls} (c^{'} ∣ x)] + E_{x \sim p_{r} (x)} [- lo g D_{cls} (c ∣ G (x, c))]$ Reconstruction loss: $L_{rec} (G) = E_{x \sim p_{r} (x)} [∣∣ x - G (G (x, c), c^{'}) ∣ ∣_{1}]$ where:

$G$ is the generator

$D$ is the discriminator

$D_{src}$ is the discriminator’s source prediction

$D_{cls}$ is the discriminator’s domain classification

$p_{r}$ is the distribution of the input data

$c$ is the target domain label.

$c^{'}$ is the original domain label.

The full objective function for StarGAN can be written as: $min_{θ_{G}} max_{θ_{D}} L = L_{adv} (G, D) + λ_{cls} L_{cls} (G, D) + λ_{rec} L_{rec} (G)$ where:

$θ_{G}$ and $θ_{D}$ are the parameters of the generator and discriminator, respectively

$λ_{cls}$ and $λ_{rec}$ are hyperparameters controlling the importance of each loss term

Link to original

StyleGAN
Definition

StyleGAN is a GAN architecture with a newly designed generator. The generator is structured with unsupervised separation of high-level attributes (e.g. pose) and stochastic variation in the generated images.

Architecture

Mapping Network

Instead of feeding random noise directly into the generator, styleGAN uses a mapping network to map the input latent vector into an intermediate latent space $W$ .

Adaptive Instance Normalization (AdaIN)

The AdaIN operation is defined as $AdaIN (x_{i}, y) = y_{s, i} \frac{x _{i} - μ ( x )}{σ ( x _{i} )} + y_{b, i}$ where $x_{i}$ is the $i$ -th feature map, $y_{s, i}$ and $y_{b, i}$ are style scale and bias, and $μ (x_{i})$ and $σ (x_{i})$ are the mean and standard deviation of $x_{i}$ .

The $y_{..}$ are learning parameters, and are generated by affine transformation of the intermediate latent vector $w$ . $[y_{s, 1}, y_{b, 1}, \dots, y_{s, n}, y_{b, n}] := Mw + b = A (w)$ where $M$ and $b$ are the trained weight and bias respectively.

It is used to inject the style information at each layer of the generator.

Stochastic variation

The noise image ( $B$ ) is added to each layer of the synthesis network. It provides the stochastic variation in the generated images.

Progressive Growing

The network starts generating low-resolution images and progressively increases the resolution. We can control the rate of style change by controlling the stage AdaIN appended.
Link to original

Neural Radiance Fields

Neural Radiance Fields
Definition

Neural Radiance Fields (NeRF) is a 3D scene representation and rendering model.

Architecture

NeRF represents a 3D scene as a continuous volumetric function, learned by a Neural Network. The function maps 3D coordinates and 2D viewing directions to color and density values. $F_{Θ} : (x, d) \to (c, σ)$ where $x := (x, y, z)$ is the 3D location, $d := (θ, ϕ)$ is the viewing direction, $c := (r, g, b)$ is the color, and $σ$ is the volume density.

Volume Rendering

$C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t$ where:

$t_{n}$ and $t_{f}$ are near and far bounds.

$r (t) = o + t d$ is the camera ray through a pixel.

$T (t) = exp (- \int_{t_{n}}^{t} σ (r (s)) d s)$ denotes the accumulated transmittance along the ray from $t_{n}$ to $t$ .

Loss Function

The loss function of NeRF model is defined as $L = r \in R \sum ∣∣ \hat{C} (r) - C (r) ∣ ∣_{2}^{2}$ where $R$ is the set of rays in the training images, $\hat{C (r)}$ is the predicted color for poins on the ray $r$ , and $C (r)$ is the gound truth color for ray $r$ .
Link to original

My Knowledge Base

Explorer

Computer Vision Note

Loss Functions and Optimization

Loss Functions

Cross-Entropy Loss

Definition

Kullback-Leibler Divergence

Definition

Facts

Optimization

Gradient Descent

Definition

Examples

Solution of a linear system

Cross Validation

Definition

Facts

Neural Networks

Neural Networks and Backpropagation

Neural Network

Definition

Facts

Backpropagation

Definition

Patterns in Gradient Flow

Algorithm

Scalar Case

Vector Case

Examples

Activation Function

Definition

Examples

Logistic Function

Definition

Facts

Hyperbolic Tangent Function

Definition

Rectified Linear Unit Function

Definition

Facts

ReLU6

Definition

Gaussian-Error Linear Unit

Definition

Parametric ReLU

Definition

Facts

Exponential Linear Unit

Definition

Swish Function

Definition

Weight Initialization

LeCun Initialization

Definition

Xavier Initialization

Definition

He Initialization

Definition

Convolutional Neural Networks

Convolutional Neural Network

Definition

Convolutional Layer

Stride

Padding

VGG Net

Definition

Inception Net

Definition

Architecture

Inception Net V1

Inception Module

Inception Net V2, V3

Factorized Convolution

Asymmetric Convolution

Label Smoothing

ResNet

Definition

Architecture

Skip Connection