Definition

ConvNeXt is a deep Convolutional Neural Network architecture. It was designed to bridge the gap between CNN and Vision Transformer by incorporating some of the design principles from transformers into a purely convolutional model.
Architecture

Macro Design
The overall structure is similar to ResNet, with four stages of processing. Each stage consists of multiple blocks, and the spatial resolution is downsampled between the stages.
Stem Layer
Instead of the traditional convolution with stride , ConvNext uses a convolution with stride (non-overlapping) for initial downsampling. It mimics the patchify stem used in ViT.
ConvNeXt Block

ConvNeXt block is the basic building block of the network. It is a modernized version of the ResNet block. It consists of the following layers in sequence:
- Depthwise Convolution inspired by ResNeXt, increasing the receptive and performance.
- Layer Norm
- Pointwise convolution to increase channel dimension (depthwise separable convolution)
- GELU activation function
- pointwise convolution to reduce channel dimension
- skip connection