A neural network architecture based entirely on attention mechanisms, replacing recurrence and convolutions for sequence processing. Transformers are the foundation of virtually all modern language models, vision models, and multimodal AI systems.
The transformer consists of encoder and decoder stacks, each containing layers of multi-head self-attention and feed-forward networks with residual connections and layer normalization. The encoder processes the full input in parallel using bidirectional self-attention, while the decoder generates output tokens autoregressively using masked self-attention and cross-attention to the encoder output.
The original transformer uses learned positional embeddings, 6 encoder/decoder layers, 8 attention heads, and 512-dimensional representations. Modern variants include encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures scaled to billions of parameters. Key innovations include pre-norm layer ordering, RMSNorm, grouped query attention, and flash attention for efficient computation on modern hardware.