The Quadratic Wall
Give an agent an hour of video or a quarter of financial filings and the model has to hold all of it in attention at once to answer a question that spans the whole thing. That is where standard attention breaks. Self-attention compares every token with every other token, so its cost grows with the square of the sequence length: double the context and you roughly quadruple the compute and the memory. A page of text is fine. A two-hour video, where every sampled frame contributes dozens to hundreds of tokens, is not.
This is the single biggest reason "just make the context window bigger" was hard for so long, and why the 2025 to 2026 wave of million-token multimodal models is really a wave of attention engineering. This guide is about the mechanics: what attention actually costs, and the four families of tricks that get around it. The concepts are vendor-neutral. Mixpeek appears at the end as one place these models run in production.
What Attention Actually Costs
For a sequence of n tokens with dimension d, attention computes three projections (queries Q, keys K, values V), then scores every query against every key: the matrix Q times K-transpose is n by n. Softmax that, multiply by V, done. Two costs matter:
The KV cache, the keys and values you keep around for autoregressive decoding, grows only linearly with context, so people sometimes think long context is cheap. It is not: every new token still attends over all previous ones, so decode-time attention is quadratic in total. Multimodal makes it worse, because images and video frames turn into many tokens fast. Cutting the attention cost is the whole game.
Family 1: Fixed Sparsity, Do Not Attend to Everything
The simplest idea: most tokens do not need to see most other tokens, so restrict the pattern.
Fixed sparsity is cheap and predictable, but the pattern is chosen by hand, not by the content.
Family 2: Learned / Dynamic Sparsity
Instead of a fixed mask, let the model decide which tokens are worth attending to per query. Approaches route each query to a small set of relevant keys (top-k selection, hashing into buckets, or a lightweight scorer that prunes the candidate set before the expensive step). The 2026 long-context multimodal models lean on this: MiniMax Sparse Attention, for example, reduces per-token attention compute to roughly one twentieth of dense attention and reports large prefill and decode speedups at a one-million-token context. Dynamic sparsity adapts to the input (a dense table and a blank margin get different budgets) at the cost of the routing logic itself and less predictable memory.
Family 3: Linear Attention, Change the Math
Fixed and dynamic sparsity keep the softmax and prune the pairs. Linear attention removes the quadratic term algebraically. Replace the softmax with a kernel feature map phi so that attention is approximately phi(Q) times (phi(K)-transpose times V). Because matrix multiplication is associative, you compute K-transpose times V first, a d by d matrix, then multiply by Q. That reorders the cost from O(n squared times d) to O(n times d squared), which is linear in sequence length.
The recurrent cousins push this further. State-space models (the Mamba line) carry a fixed-size hidden state that summarizes everything seen so far and update it token by token, so memory does not grow with context at all. The tradeoff across all of these is the same: exact, long-range, needle-in-a-haystack recall gets weaker, because you are compressing the past into a bounded representation instead of keeping every key. Linear attention is superb for streaming and throughput, shakier when the answer depends on one exact token far away.
Family 4: The Hybrid Stack (What Ships)
No single trick wins, so production long-context models mix them. A common recipe: most layers use sparse or linear attention for throughput, and a few layers keep full (or near-full) attention to preserve global, exact recall. Interleaving cheap-but-lossy layers with occasional expensive-but-precise ones gives you long context without paying quadratic cost everywhere, and without losing the ability to pull a specific fact from far away. When you read that a model "supports 1M context," it is almost always a hybrid like this under the hood, not dense attention over a million tokens.
Why This Matters for Agent Perception
An agent that has to *understand* a whole thing, not just retrieve snippets from it, needs the context in one pass:
But sparsity is lossy, and this is the key caveat: the more you compress attention, the more you risk missing an exact long-range detail. That is exactly why long context does not kill retrieval.
Long Context vs Retrieval: Complements, Not Rivals
A recurring mistake is treating a bigger context window as a replacement for search. They solve different problems. Long context handles coherence within a bounded amount of content the model sees at once. Retrieval handles scale across a corpus far larger than any window, and it gives you exact, auditable recall of specific items. The robust pattern for agents is both: use retrieval to pull the right documents or clips from a large index, then use a long-context multimodal model to reason over that focused set in one pass. Efficient attention makes the second step affordable; retrieval makes the first step possible.
Where This Lands in Practice: Mixpeek
Efficient attention is why "understand this whole video" is now a single model call instead of a chunking pipeline, and it is how Mixpeek's scene-understanding extractors handle long clips. Point a Managed collection at a bucket of video and Mixpeek can run a long-context multimodal model, for example MiniMax-M3 with its million-token sparse-attention context, to produce whole-clip descriptions and answers rather than isolated per-frame captions. Those descriptions and embeddings are then indexed, so retrieval scales the perception layer past any single model's context window, which is the complements-not-rivals point in production form.
If you are building the pieces around this, the companion guides on long-context video understanding, video RAG, and multi-stage retrieval cover the retrieval half. The short version: make attention cheap enough to read the whole thing, then index the result so you never have to read all of it at once.