Definition

Convolutional vision transformer (CvT) uses convolutional layers instead of the fully connected layers to improve its performance and efficiency.

Architecture

Convolutional Token Embedding

CvT replaced the patch embedding used in ViT with a convolutional layer.

Convolutional Projection

Instead of using linear transformation to project the input into query (Q), key (K), and value (V), convolutional layers are used

Squeezed Convolutional Projection

The squeezed convolutional projection is utilized to reduce computational complexity or to downsample the feature map.