Definition

Multiscale vision transformer (MViT) applied the idea of CvT to the video inputs. Additionally, the multi head pooling attention is used for hierarchical structure.

Architecture

Pooling Attention

Pooling attention decreases the spatial dimension and increases the channel dimension of the Q, K, and V before applying self-attention. It is processed by the convolutional layer with a stride greater than 1.

My Knowledge Base

Explorer

Multiscale Vision Transformer

Definition

Architecture

Pooling Attention

Graph View

Table of Contents

Backlinks