A neural network mechanism that dynamically computes relevance weights between input elements, allowing models to focus on the most informative parts. Attention is the core building block of transformer models that power modern multimodal AI.
Attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between a query vector and key vectors. For each query, the mechanism scores every key to determine how much attention to pay to each corresponding value. This allows the model to dynamically focus on different input parts depending on context, rather than processing all information equally.
Scaled dot-product attention computes scores as Q*K^T/sqrt(d_k), followed by softmax normalization and multiplication with V. Multi-head attention runs multiple attention operations in parallel with different learned projections, capturing different types of relationships. Self-attention applies queries, keys, and values from the same input sequence. Cross-attention uses queries from one source and keys/values from another, enabling multimodal fusion.