Definition

ConvNeXt is a deep Convolutional Neural Network architecture. It was designed to bridge the gap between CNN and Vision Transformer by incorporating some of the design principles from transformers into a purely convolutional model.

Architecture

Macro Design

The overall structure is similar to ResNet, with four stages of processing. Each stage consists of multiple blocks, and the spatial resolution is downsampled between the stages.

Stem Layer

Instead of the traditional $7 \times 7$ convolution with stride $2$ , ConvNext uses a $4 \times 4$ convolution with stride $4$ (non-overlapping) for initial downsampling. It mimics the patchify stem used in ViT.

ConvNeXt Block

ConvNeXt block is the basic building block of the network. It is a modernized version of the ResNet block. It consists of the following layers in sequence:

$7 \times 7$ Depthwise Convolution inspired by ResNeXt, increasing the receptive and performance.
Layer Norm
$1 \times 1$ Pointwise convolution to increase channel dimension (depthwise separable convolution)
GELU activation function
$1 \times 1$ pointwise convolution to reduce channel dimension
skip connection

My Knowledge Base

Explorer

ConvNeXt

Definition

Architecture

Macro Design

Stem Layer

ConvNeXt Block

Graph View

Table of Contents

Backlinks