Definition

Convolutional vision transformer (CvT) uses convolutional layers instead of the fully connected layers to improve its performance and efficiency.
Architecture
Convolutional Token Embedding
CvT replaced the patch embedding used in ViT with a convolutional layer.
Convolutional Projection

Instead of using linear transformation to project the input into query (Q), key (K), and value (V), convolutional layers are used
Squeezed Convolutional Projection

The squeezed convolutional projection is utilized to reduce computational complexity or to downsample the feature map.