What is Cross-Attention

Cross-Attention - Attention between two different input sequences or modalities

An attention mechanism where queries come from one input and keys/values come from another, enabling information exchange between different sequences or modalities. Cross-attention is the primary mechanism for fusing information in multimodal AI models.

How It Works

Cross-attention computes attention between two different inputs: the query input attends to the key/value input. For example, in a multimodal model, text tokens (queries) attend to image patch features (keys/values) to incorporate visual information into text processing. This allows each modality to selectively extract relevant information from the other modality based on learned compatibility patterns.

Technical Details

Cross-attention uses the standard attention formula Q*K^T/sqrt(d_k) but Q comes from one input and K,V from another. In vision-language models like Flamingo and BLIP-2, cross-attention layers are interleaved with self-attention layers. The Q-Former (BLIP-2) uses learnable query tokens that cross-attend to frozen image features. Cross-attention cost is O(n*m) where n and m are the lengths of the two inputs.

Best Practices

Use cross-attention for multimodal fusion rather than simple concatenation of features
Place cross-attention at multiple layers for progressive information exchange
Use learnable query tokens to control the amount of information extracted from each modality
Initialize cross-attention layers carefully to maintain training stability

Common Pitfalls

Adding cross-attention at every layer, which increases computation without proportional benefit
Not pre-normalizing inputs from different modalities that may have different scales
Ignoring the asymmetry of cross-attention (which modality provides queries matters)
Using cross-attention when simpler fusion methods (concatenation, addition) would suffice for the task

Advanced Tips

Implement bidirectional cross-attention where both modalities attend to each other
Use gated cross-attention to learn when to incorporate cross-modal information
Apply sparse cross-attention for efficiency when one input is much longer than the other
Fine-tune only cross-attention layers when adapting a frozen multimodal model to new tasks

Related Terms

ACID API Blob Storage CLIP Embedding