Definition

Vision transformer (ViT) applies Transformer architecture to the vision tasks. The model considers an image as a sequence of patches.

Architecture

Positional Enbedding

ViT does not use pre-designed positional encoding, it leaves it as a learnable parameter. By doing so, ViT does not imply any inductive bias unlike to CNN