Efficient Attention: How Models Read Hour-Long Video and Book-Length Documents

The Quadratic Wall

Give an agent an hour of video or a quarter of financial filings and the model has to hold all of it in attention at once to answer a question that spans the whole thing. That is where standard attention breaks. Self-attention compares every token with every other token, so its cost grows with the square of the sequence length: double the context and you roughly quadruple the compute and the memory. A page of text is fine. A two-hour video, where every sampled frame contributes dozens to hundreds of tokens, is not.

This is the single biggest reason "just make the context window bigger" was hard for so long, and why the 2025 to 2026 wave of million-token multimodal models is really a wave of attention engineering. This guide is about the mechanics: what attention actually costs, and the four families of tricks that get around it. The concepts are vendor-neutral. Mixpeek appears at the end as one place these models run in production.

What Attention Actually Costs

For a sequence of n tokens with dimension d, attention computes three projections (queries Q, keys K, values V), then scores every query against every key: the matrix Q times K-transpose is n by n. Softmax that, multiply by V, done. Two costs matter:

Compute: the Q times K-transpose step is O(n squared times d). At 100K tokens that matrix has 10 billion entries per head per layer.

Memory: you materialize (or at least stream) an n by n score matrix. FlashAttention avoids storing the full matrix by tiling, which saves memory bandwidth, but the *compute* is still quadratic.

The KV cache, the keys and values you keep around for autoregressive decoding, grows only linearly with context, so people sometimes think long context is cheap. It is not: every new token still attends over all previous ones, so decode-time attention is quadratic in total. Multimodal makes it worse, because images and video frames turn into many tokens fast. Cutting the attention cost is the whole game.

Family 1: Fixed Sparsity, Do Not Attend to Everything

The simplest idea: most tokens do not need to see most other tokens, so restrict the pattern.

Sliding-window (local) attention. Each token attends only to a fixed window of neighbors, say 512 on each side. Cost drops to O(n times w), linear in sequence length. Stacking layers lets information propagate beyond the window (like a CNN's growing receptive field). The cost: no direct long-range link in a single layer.

Dilated / strided attention. Skip tokens at fixed intervals so a small window still spans a long range, at lower resolution. Good for covering distance cheaply.

Block-sparse with global tokens (Longformer, BigBird). Combine local windows with a few global tokens that attend to and are attended by everyone (think a summary or a task token), plus sometimes random links. BigBird proved that local plus global plus random approximates full attention well while staying linear. The global tokens are the trick that preserves long-range recall a pure window loses.

Fixed sparsity is cheap and predictable, but the pattern is chosen by hand, not by the content.

Family 2: Learned / Dynamic Sparsity

Instead of a fixed mask, let the model decide which tokens are worth attending to per query. Approaches route each query to a small set of relevant keys (top-k selection, hashing into buckets, or a lightweight scorer that prunes the candidate set before the expensive step). The 2026 long-context multimodal models lean on this: MiniMax Sparse Attention, for example, reduces per-token attention compute to roughly one twentieth of dense attention and reports large prefill and decode speedups at a one-million-token context. Dynamic sparsity adapts to the input (a dense table and a blank margin get different budgets) at the cost of the routing logic itself and less predictable memory.

Family 3: Linear Attention, Change the Math

Fixed and dynamic sparsity keep the softmax and prune the pairs. Linear attention removes the quadratic term algebraically. Replace the softmax with a kernel feature map phi so that attention is approximately phi(Q) times (phi(K)-transpose times V). Because matrix multiplication is associative, you compute K-transpose times V first, a d by d matrix, then multiply by Q. That reorders the cost from O(n squared times d) to O(n times d squared), which is linear in sequence length.

The recurrent cousins push this further. State-space models (the Mamba line) carry a fixed-size hidden state that summarizes everything seen so far and update it token by token, so memory does not grow with context at all. The tradeoff across all of these is the same: exact, long-range, needle-in-a-haystack recall gets weaker, because you are compressing the past into a bounded representation instead of keeping every key. Linear attention is superb for streaming and throughput, shakier when the answer depends on one exact token far away.

Family 4: The Hybrid Stack (What Ships)

No single trick wins, so production long-context models mix them. A common recipe: most layers use sparse or linear attention for throughput, and a few layers keep full (or near-full) attention to preserve global, exact recall. Interleaving cheap-but-lossy layers with occasional expensive-but-precise ones gives you long context without paying quadratic cost everywhere, and without losing the ability to pull a specific fact from far away. When you read that a model "supports 1M context," it is almost always a hybrid like this under the hood, not dense attention over a million tokens.

Why This Matters for Agent Perception

An agent that has to *understand* a whole thing, not just retrieve snippets from it, needs the context in one pass:

Whole-clip video reasoning. "Did the same person appear in the intro and the outro" or "summarize how the argument evolves over the hour" are questions no single 8-second window can answer. Efficient attention is what lets a model hold the whole clip.

Cross-document coherence. Reasoning over a full contract or a quarter of filings needs within-document links that chunking severs.

Streaming perception. For live or very long input, linear and state-space attention give constant per-token cost, so an agent can keep watching without the cost blowing up.

But sparsity is lossy, and this is the key caveat: the more you compress attention, the more you risk missing an exact long-range detail. That is exactly why long context does not kill retrieval.

Long Context vs Retrieval: Complements, Not Rivals

A recurring mistake is treating a bigger context window as a replacement for search. They solve different problems. Long context handles coherence within a bounded amount of content the model sees at once. Retrieval handles scale across a corpus far larger than any window, and it gives you exact, auditable recall of specific items. The robust pattern for agents is both: use retrieval to pull the right documents or clips from a large index, then use a long-context multimodal model to reason over that focused set in one pass. Efficient attention makes the second step affordable; retrieval makes the first step possible.

Where This Lands in Practice: Mixpeek

Efficient attention is why "understand this whole video" is now a single model call instead of a chunking pipeline, and it is how Mixpeek's scene-understanding extractors handle long clips. Point a Managed collection at a bucket of video and Mixpeek can run a long-context multimodal model, for example MiniMax-M3 with its million-token sparse-attention context, to produce whole-clip descriptions and answers rather than isolated per-frame captions. Those descriptions and embeddings are then indexed, so retrieval scales the perception layer past any single model's context window, which is the complements-not-rivals point in production form.

If you are building the pieces around this, the companion guides on long-context video understanding, video RAG, and multi-stage retrieval cover the retrieval half. The short version: make attention cheap enough to read the whole thing, then index the result so you never have to read all of it at once.

The Quadratic Wall

What Attention Actually Costs

Family 1: Fixed Sparsity, Do Not Attend to Everything

Family 2: Learned / Dynamic Sparsity

Family 3: Linear Attention, Change the Math

Family 4: The Hybrid Stack (What Ships)

Why This Matters for Agent Perception

Long Context vs Retrieval: Complements, Not Rivals

Where This Lands in Practice: Mixpeek

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Multi-Index Search Architecture: How to Combine Visual, Audio, and Text Embeddings for Rich Media

How to Build a Multimodal RAG Pipeline

Embedding Portability and Versioning: Why Your Vectors Are Not as Portable as You Think