Definition

Convolutional vision transformer (CvT) uses convolutional layers instead of the fully connected layers to improve its performance and efficiency.

Architecture

CvT replaced the patch embedding used in ViT with a convolutional layer.

Instead of using linear transformation to project the input into query (Q), key (K), and value (V), convolutional layers are used

The squeezed convolutional projection is utilized to reduce computational complexity or to downsample the feature map.