An attention mechanism where queries come from one input and keys/values come from another, enabling information exchange between different sequences or modalities. Cross-attention is the primary mechanism for fusing information in multimodal AI models.
Cross-attention computes attention between two different inputs: the query input attends to the key/value input. For example, in a multimodal model, text tokens (queries) attend to image patch features (keys/values) to incorporate visual information into text processing. This allows each modality to selectively extract relevant information from the other modality based on learned compatibility patterns.
Cross-attention uses the standard attention formula Q*K^T/sqrt(d_k) but Q comes from one input and K,V from another. In vision-language models like Flamingo and BLIP-2, cross-attention layers are interleaved with self-attention layers. The Q-Former (BLIP-2) uses learnable query tokens that cross-attend to frozen image features. Cross-attention cost is O(n*m) where n and m are the lengths of the two inputs.