A neural network architecture based entirely on attention mechanisms, replacing recurrence and convolutions for sequence processing. Transformers are the foundation of virtually all modern language models, vision models, and multimodal AI systems.
The transformer consists of encoder and decoder stacks, each containing layers of multi-head self-attention and feed-forward networks with residual connections and layer normalization. The encoder processes the full input in parallel using bidirectional self-attention, while the decoder generates output tokens autoregressively using masked self-attention and cross-attention to the encoder output.
The original transformer uses learned positional embeddings, 6 encoder/decoder layers, 8 attention heads, and 512-dimensional representations. Modern variants include encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures scaled to billions of parameters. Key innovations include pre-norm layer ordering, RMSNorm, grouped query attention, and flash attention for efficient computation on modern hardware.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS