CNN Note

VGG Net
Definition

VGG model is a deep Convolutional Neural Network architecture. The VGG model is characterized by its depth and uniformity. It consists of a series of convolutional layers followed by fully connected layers.
Link to original

Inception Net
Definition

Inception Net model is a deep Convolutional Neural Network architecture using the inception module.

Architecture

Inception Net V1

Inception Module

Inception module is a building block of the inception net. It uses multiple filter sizes ( $1 \times 1$ , $3 \times 3$ , and $5 \times 5$ ) and pooling operations in parallel, allowing the network to capture features at different scales simultaneously.

The $conv (1 \times 1)$ are used for dimensionality reduction, helping to reduce computational complexity.

Inception Net V2, V3

Factorized Convolution

The large convolutions in the inception module were replaced with multiple smaller convolutions $conv (5 \times 5) \to conv (3 \times 3) * 2$ reducing parameters and computational cost.

Asymmetric Convolution

$conv (n \times n)$ are decomposed into $conv (1 \times n)$ and $conv (n \times 1)$ convolution.

Label Smoothing

The model prevent from becoming overconfident applying the label smoothing to the labels. $y_{i}^{'} = (1 - ϵ) y_{i} + \frac{ϵ}{K}$ where $y_{i}^{'}$ is the smoothed label, $y_{i}$ is the original one-hot encoded label, $ϵ$ is the smoothing parameter, and $K$ is the number of classes.
Link to original

ResNet
Definition

ResNet is a deep Convolutional Neural Network architecture. It was designed to address the degradation problem in very deep neural networks.

Architecture

Skip Connection

The core innovation of ResNet is the introduction of skip connections (shortcut connections or residual connections). These connections allow the network to bypass one or more layers, creating a direct path for information flow. It performs identity mapping, allowing the network to easily learn the identity function if needed.

The residual block is represented as $y = F (x) + x$ where $x$ is the input to the block, $F (x)$ is the learnable residual mapping typically including multiple layers, and $y$ is the output of the block

Skip connection create a mixture of deep and shallow models. $N$ skip connections, makes $2^{N}$ possible paths, where each path could have up to $N$ modules.

Bottleneck Block

The bottleneck architecture is used in deeper versions of ResNet to improve computational efficiency while maintaining or increasing the network’s representational power. The bottleneck block consists of three layers in sequence: $1 \times 1 \to 3 \times 3 \to 1 \times 1$ convolutions. The first $1 \times 1$ convolution reduces the number of channels, the $3 \times 3$ convolution operates on the reduced representation, and the second $1 \times 1$ convolution increases the number of channels back to the original.

Architecture Variants

Full Pre-Activation

Full pre-activation is an improvement to the original ResNet architecture. This modification aims to improve the flow of information through the network and make training easier. In full pre-activation, the order of operations in each residual block is changed to move the batch normalization and activation functions before the convolutions.

WideResNet

WideResNet increases the number of channels in the residual blocks rather than increasing Network’s depth.

ResNeXt

The ResNeXt model substitutes the $3 \times 3$ convolution of residual block of ResNet with the Grouped Convolution. ResNeXt achieve better performance than ResNet with the same number of parameters, thanks to its more efficient use of model capacity through the grouped convolution.
Link to original

DenseNet
Definition

DenseNet is a deep Convolutional Neural Network architecture.

Architecture

In DenseNet, each layer’s input consists of the feature maps from all preceding layers, not just the immediately previous layer, allowing significant feature reuse and learn more compact representations. The dense connections also facilitate better gradient flow during backpropagation, making it easier to train deeper networks. The dense connections adopt full pre-activation structure.
Link to original

SENet
Definition

SENet is a deep Convolutional Neural Network architecture that explicitly modeling interdependencies between channels.

Architecture

SE Block

SE block consists of the two operations: squeeze and excitation operations. The original feature maps are rescaled using the channel-wise weights produced by the excitation operation. SE block can be integrated into various existing architectures, not just used as a standalone network.

Squeeze Operation

The squeeze operation aggregates feature maps across spatial dimensions to produce a channel descriptor. It’s typically done using Global Average Pooling. For a given feature map of size $h \times w \times c$ , the squeeze operation produces a $1 \times 1 \times c$ vector.

Excitation Operation

The excitation operation takes the output of the squeeze operation and produces a set of per-channel modulation weights. It’s typically implemented using two fully connected layers with a non-linearity in between. $s = σ (W_{2} ReLU (W_{1} z))$ where $σ$ is a Sigmoid Function, and $W_{1}$ and $W_{2}$ are learnable parameters.
Link to original

MobileNet
Definition

MobileNet is a lightweight deep Convolutional Neural Network architecture.

Architecture

MobileNet V1

MobileNet v1 introduced the depthwise separable convolutions consist of two operations: depthwise convolution and pointwise convolution. It significantly reduces computational cost and model size maintaining the model performance.

Inspired by ResNeXt, the Depthwise Convolution applies a filter to each input channel. It aggregates spatial information only. The pointwise convolution uses $1 \times 1$ convolutions to combine the outputs from the depthwise step. It channel-wisely combines the information.

MobileNet V2

MobileNet v2 introduced the inverted residual block consists of three layers: expansion, depthwise convolution, and projection layer. The expansion layer expands the input to a higher dimension $h \times w \times c \to h \times w \times k c$ The projection layer reduces back the channel size $h \times w \times k c \to h \times w \times c^{'}$ , where $c^{'} ≪ k c$ . Some of the ReLU activation functions in the narrow layers are replaced with the other (ReLU6 or Linear) to prevent information loss.

MobileNet V3

MobileNet v3 appended SE-block insider the inverted residual block.

The sigmoid functions used for SE-block are substituted with the hard sigmoid function more computationally light. $\frac{ReLU6 ( x + 3 )}{6} \approx σ (x)$ And the ReLU used in the mobileNet v2 is replaced with hard swish activation function $h-swish (x) = x \frac{ReLU6 ( x + 3 )}{6} \approx x σ (x)$

The model architecture is optimized using the auto-ml technique network architecture search (NAS).
Link to original

EfficientNet
Definition

EfficientNet is a deep Convolutional Neural Network architecture. It suggests a method to scaling CNN to achieve better performance while maintaining efficiency.

Architecture

As the baseline model of EfficientNet, ResNet and MobileNet are considered.

Compound Scaling

EfficientNet introduces a compound scaling method that uniformly scales network width, depth, and resolution using a compound coefficient $ϕ$ .

Perform a grid search about the depth $d = α^{ϕ}$ , the width $w = β^{ϕ}$ , and the resolution $r = γ^{ϕ}$ for a fixed $ϕ = 1$ .

Control the model size by changing $ϕ$ , for the fixed $α$ , $β$ , and $γ$ .

Link to original

ConvNeXt
Definition

ConvNeXt is a deep Convolutional Neural Network architecture. It was designed to bridge the gap between CNN and Vision Transformer by incorporating some of the design principles from transformers into a purely convolutional model.

Architecture

Macro Design

The overall structure is similar to ResNet, with four stages of processing. Each stage consists of multiple blocks, and the spatial resolution is downsampled between the stages.

Stem Layer

Instead of the traditional $7 \times 7$ convolution with stride $2$ , ConvNext uses a $4 \times 4$ convolution with stride $4$ (non-overlapping) for initial downsampling. It mimics the patchify stem used in ViT.

ConvNeXt Block

ConvNeXt block is the basic building block of the network. It is a modernized version of the ResNet block. It consists of the following layers in sequence:

$7 \times 7$ Depthwise Convolution inspired by ResNeXt, increasing the receptive and performance.

Layer Norm

$1 \times 1$ Pointwise convolution to increase channel dimension (depthwise separable convolution)

GELU activation function

$1 \times 1$ pointwise convolution to reduce channel dimension

skip connection

Link to original

My Knowledge Base

Explorer

CNN Note

VGG Net

Definition

Inception Net

Definition

Architecture

Inception Net V1

Inception Module

Inception Net V2, V3

Factorized Convolution

Asymmetric Convolution

Label Smoothing

ResNet

Definition

Architecture

Skip Connection

Bottleneck Block

Architecture Variants

Full Pre-Activation

WideResNet

ResNeXt

DenseNet

Definition

Architecture

SENet

Definition

Architecture

SE Block

Squeeze Operation

Excitation Operation

MobileNet

Definition

Architecture

MobileNet V1

MobileNet V2

MobileNet V3

EfficientNet

Definition

Architecture

Compound Scaling

ConvNeXt

Definition

Architecture

Macro Design

Stem Layer

ConvNeXt Block

Graph View

Backlinks