A neural network architecture that applies the transformer's self-attention mechanism directly to image patches, replacing convolutional neural networks for visual understanding. Vision transformers power the visual encoders in modern multimodal AI systems.
A Vision Transformer divides an input image into fixed-size patches (typically 16x16 or 14x14 pixels), linearly projects each patch into an embedding, adds positional embeddings, and processes the sequence of patch tokens through standard transformer encoder layers. A classification token ([CLS]) aggregates global information. Self-attention enables each patch to attend to all other patches, capturing both local and global relationships.
ViT-Base uses 12 layers, 768 dimensions, 12 heads (86M parameters). ViT-Large uses 24 layers, 1024 dimensions (307M parameters). Variants include DeiT (data-efficient training), Swin Transformer (shifted window attention for efficiency), and EVA (enhanced visual architectures). Pretrained on ImageNet-21K or larger datasets. ViT serves as the visual backbone in CLIP, BLIP-2, and other vision-language models.