Definition

ConvNeXt is a deep Convolutional Neural Network architecture. It was designed to bridge the gap between CNN and Vision Transformer by incorporating some of the design principles from transformers into a purely convolutional model.

Architecture

Macro Design

The overall structure is similar to ResNet, with four stages of processing. Each stage consists of multiple blocks, and the spatial resolution is downsampled between the stages.

Stem Layer

Instead of the traditional convolution with stride , ConvNext uses a convolution with stride (non-overlapping) for initial downsampling. It mimics the patchify stem used in ViT.

ConvNeXt Block

ConvNeXt block is the basic building block of the network. It is a modernized version of the ResNet block. It consists of the following layers in sequence: