An attention mechanism where queries come from one input and keys/values come from another, enabling information exchange between different sequences or modalities. Cross-attention is the primary mechanism for fusing information in multimodal AI models.
Cross-attention computes attention between two different inputs: the query input attends to the key/value input. For example, in a multimodal model, text tokens (queries) attend to image patch features (keys/values) to incorporate visual information into text processing. This allows each modality to selectively extract relevant information from the other modality based on learned compatibility patterns.
Cross-attention uses the standard attention formula Q*K^T/sqrt(d_k) but Q comes from one input and K,V from another. In vision-language models like Flamingo and BLIP-2, cross-attention layers are interleaved with self-attention layers. The Q-Former (BLIP-2) uses learnable query tokens that cross-attend to frozen image features. Cross-attention cost is O(n*m) where n and m are the lengths of the two inputs.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS