Definition

Multiscale vision transformer (MViT) applied the idea of CvT to the video inputs. Additionally, the multi head pooling attention is used for hierarchical structure.
Architecture
Pooling Attention

Pooling attention decreases the spatial dimension and increases the channel dimension of the Q, K, and V before applying self-attention. It is processed by the convolutional layer with a stride greater than 1.