A neural network architecture that applies the transformer's self-attention mechanism directly to image patches, replacing convolutional neural networks for visual understanding. Vision transformers power the visual encoders in modern multimodal AI systems.
A Vision Transformer divides an input image into fixed-size patches (typically 16x16 or 14x14 pixels), linearly projects each patch into an embedding, adds positional embeddings, and processes the sequence of patch tokens through standard transformer encoder layers. A classification token ([CLS]) aggregates global information. Self-attention enables each patch to attend to all other patches, capturing both local and global relationships.
ViT-Base uses 12 layers, 768 dimensions, 12 heads (86M parameters). ViT-Large uses 24 layers, 1024 dimensions (307M parameters). Variants include DeiT (data-efficient training), Swin Transformer (shifted window attention for efficiency), and EVA (enhanced visual architectures). Pretrained on ImageNet-21K or larger datasets. ViT serves as the visual backbone in CLIP, BLIP-2, and other vision-language models.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS