Definition

Vision transformer (ViT) applies Transformer architecture to the vision tasks. The model considers an image as a sequence of patches.
Architecture
Positional Enbedding

ViT does not use pre-designed positional encoding, it leaves it as a learnable parameter. By doing so, ViT does not imply any inductive bias unlike to CNN