Budget-Aware Multi-Vector Retrieval: How to Make Late Interaction Affordable at Scale

The Problem: Multi-Vector Retrieval Is Accurate and Expensive

Single-vector retrieval compresses an entire item into one point. A document becomes 1024 numbers, an image becomes 1024 numbers, and a video clip becomes 1024 numbers. Search is then one nearest-neighbor lookup per item. It is cheap, and for many queries it is good enough.

It breaks on compositional queries. Ask for "a blue chart on the second page next to a handwritten note" and a single vector for the page has already blurred those three conditions into one average. The information that would let you tell a matching page from a near-miss has been pooled away before the query ever arrives.

Multi-vector retrieval keeps the detail. Instead of one vector per item, it stores many: one per text token, one per image patch, one per video frame. Late interaction then scores a query against an item by matching each query vector to its best counterpart in the item and summing those matches. This is the family that includes ColBERT for text, ColPali and ColQwen for visual documents, and the multimodal encoders that embed text, image patches, and video frames into one shared space.

The accuracy gain is real and well established. The cost is the catch, and it shows up in three places at once:

Storage. A page rendered as an image can produce 1030 patch vectors. At 128 dimensions in float32 that is roughly 527 KB per page. One million pages is about 527 GB of vectors, before any index overhead. The same corpus in single-vector form is a few GB.

Memory and index size. Every one of those vectors lives in the nearest-neighbor index. The index does not grow with your number of items, it grows with your number of vectors, which is hundreds of times larger.

Scoring compute. Late interaction compares every query vector to every candidate vector. A 32-token query against a 1030-vector page is 32 times 1030 dot products for a single item. Multiply by your candidate set and per-query latency climbs fast.

The naive reaction is to give up multi-vector retrieval and go back to single vectors. The better reaction, and the dominant research thread of 2025 and 2026, is to treat the number of vectors as a knob you can turn per collection and per query. That is what this guide teaches.

A Refresher on Late Interaction Scoring

You need the scoring rule in your head before the cost optimizations make sense. Late interaction scores a query against an item with MaxSim:

score(query, item) =
  sum over each query_vector q of (
    max over each item_vector d of ( similarity(q, d) )
  )

In words: for every query vector, find its single best-matching vector in the item, take that similarity, and add up those best matches across all query vectors. A query vector that has no good match in the item contributes almost nothing, so an item is rewarded only when it can cover every part of the query somewhere.

Two facts about this rule drive every cost decision below:

1. The score is dominated by a small number of strong matches. Most item vectors never become the maximum for any query vector. They sit in the index, consume storage, and are scored, but they do not change the answer. 2. The cost is the product of query vector count and item vector count. Shrink either side and you shrink storage, index size, and scoring time together.

So the whole game is: keep the vectors that win MaxSim, drop the ones that never do, and only pay for high resolution when the query needs it.

Lever 1: Token Pooling, Fewer Vectors per Item

The first lever reduces vectors at index time. The observation is that neighboring patches and tokens are highly redundant. Adjacent patches of the same sky, or three subword tokens of one rare word, point in nearly the same direction. Storing all of them is paying full price for near-duplicates.

Token pooling clusters an item's vectors and replaces each cluster with one representative, usually the mean. A common recipe:

1. Take the item's N vectors (for example 1030 patch vectors).
2. Cluster them into K groups by cosine similarity (hierarchical or k-means).
3. Replace each group with its mean vector, L2-normalized.
4. Index the K pooled vectors instead of the original N.

The striking empirical result, reported across visual and text late-interaction work, is that pooling to roughly one half or one third of the vectors costs almost nothing in retrieval quality. MaxSim only ever used the best match in each region, and the pooled centroid of a redundant region is a faithful stand-in for that best match. You delete vectors that were never going to win.

This is training-free. You do not retrain the encoder. You run a clustering pass during ingestion and store fewer, denser vectors. It composes with everything below.

Where pooling can hurt: regions that are small but semantically critical, such as a single line of fine-print disclaimer text on an otherwise empty page, can get averaged into a neighbor and lose their distinct direction. The guard is to pool by similarity rather than by fixed spatial grid, so a visually distinct region forms its own cluster instead of being folded into a large bland one.

Lever 2: Matryoshka Multi-Vector Embeddings

The second lever is structural and it is the idea behind 2026's flexible late-interaction models such as MetaEmbed. Instead of producing a flat bag of equal vectors, the encoder is trained so that the vectors are ordered by importance, coarse to fine.

The training trick borrows from Matryoshka representation learning, which originally nested information inside a single vector so that the first 64 dimensions were a usable embedding on their own, the first 256 were better, and the full vector best of all. Matryoshka multi-vector retrieval lifts that idea from dimensions to whole vectors. A small set of learnable "meta tokens" is appended to the input, and contrastive training is applied in parallel over nested groups of those tokens: the first group must work alone, the first two groups must work together, and so on up to the full set.

The result is an encoder whose output vectors are ranked. The top few vectors carry the gist. Adding more vectors adds resolution. Because the ordering is learned, the first vector is genuinely the most useful one to keep, not an arbitrary token.

Item vectors after Matryoshka training (ordered by importance):
  v1  v2   |  v3 v4   |  v5 v6 v7 v8   |  v9 ... v32
  coarse      add         add               full detail
  (gist)      structure   nuance

Pick a prefix length at query time:
  budget = tight   -&gt; use v1..v2
  budget = normal  -&gt; use v1..v8
  budget = max      -&gt; use v1..v32

This converts the vector count from a fixed property baked in at index time into a dial you set at query time. You index the full set once. You decide per query how many of those ordered vectors to actually load and score.

Lever 3: Flexible Late Interaction at Query Time

With ordered vectors in hand, flexible late interaction is straightforward: run MaxSim using only the first k query vectors against the first m item vectors, where k and m are chosen from your budget rather than fixed.

flex_score(query, item, k, m) =
  sum over first k query_vectors q of (
    max over first m item_vectors d of ( similarity(q, d) )
  )

The accuracy-versus-cost curve is smooth. Small k and m give you most of the quality for a fraction of the compute, and you climb toward full late-interaction accuracy as you raise them. In practice you choose the operating point per workload:

Workload

Vectors used

Why

Autocomplete or typeahead	top 1 to 2	Latency budget is a few milliseconds; coarse match is fine
Interactive agent search	top 8 to 16	Balanced; covers most compositional queries
High-stakes review or recall sweep	full set	Maximum recall matters more than cost

The agent can even pick the budget from the query itself. A short factual lookup runs cheap. A long multi-condition query, or a query the agent flags as high-stakes, runs at full resolution. The retrieval cost now tracks the value of the question instead of being a flat tax on every call.

Lever 4: Two-Tier Retrieve Then Rerank

The three levers above shrink the per-item work. The fourth lever shrinks the candidate set, and it is the pattern you should reach for first because it gives the largest win for the least change.

Full multi-vector scoring over a whole corpus is wasteful because almost every item is obviously irrelevant. The fix is a cheap first stage that produces a short candidate list, followed by expensive late interaction applied only to those candidates.

Stage 1  (cheap, whole corpus):
  single-vector ANN search, OR
  pooled / coarse multi-vector search (top few vectors only)
  -&gt; return top 200 candidates

Stage 2  (expensive, candidates only):
  full or near-full multi-vector late interaction
  rerank the 200 -&gt; return top 10

The first stage can be a pooled single vector per item, a mean of the item's vectors, or a fixed-dimensional encoding that approximates the multi-vector set so a standard nearest-neighbor index can retrieve candidates. Approaches such as MUVERA build exactly such a fixed encoding so the candidate stage stays a normal vector search. The second stage then pays the full late-interaction price on 200 items instead of one million.

Two design rules keep this honest:

The first stage sets your recall ceiling. If a relevant item is not in the 200 candidates, no amount of careful reranking in stage two can recover it. Tune stage-one candidate count by measuring recall at that cutoff, not by guessing.

The second stage sets your precision. This is where the fine-grained MaxSim matching earns its cost, separating the truly matching item from the near-misses the cheap stage could not tell apart.

This two-tier shape is the same skeleton as multi-stage retrieval generally. The multi-vector specific point is that stages one and two can use the same ordered embedding at different prefix lengths: a tiny prefix to retrieve, the full set to rerank.

Putting the Levers Together

The four levers stack, and a sensible default stack looks like this:

1. At ingestion, pool each item's vectors down to roughly one half, using similarity-based clustering so distinct regions survive. This is a permanent storage and index-size win. 2. Train or choose a Matryoshka multi-vector encoder if you can, so the surviving vectors are ordered by importance and the query-time dial is meaningful. 3. At query time, run a cheap first stage over the whole corpus to get a few hundred candidates. 4. Rerank the candidates with flexible late interaction, choosing the prefix length from the query's budget.

The compounding effect is large. Halving vectors at ingestion, retrieving with a coarse prefix, and reranking only a few hundred candidates can drop end-to-end cost by an order of magnitude while keeping you within a point or two of full multi-vector accuracy. That is the difference between multi-vector retrieval being a benchmark result and being something you can afford to run on every agent query.

What to Measure

Do not trust any of these optimizations without a recall check, because every one of them trades quality for cost somewhere.

Recall at the stage-one cutoff. The single most important number. It caps everything downstream.

End-to-end recall and precision at k, with the full uncompressed multi-vector pipeline as your ground-truth reference. Each lever should be evaluated as a delta against that reference.

Cost per query, broken into storage footprint, candidate-stage latency, and rerank-stage latency, so you know which lever to turn when you need to go faster or cheaper.

Tail latency, not just the average. The high-resolution queries are the ones that matter most and also the slowest. Watch the p95.

Pick an operating point by walking the curve: start from full multi-vector accuracy, turn one lever at a time, and stop when the recall delta crosses the line your application can tolerate.

Doing This in Mixpeek

In Mixpeek, the vector budget lives in two places: the collection decides what gets embedded and pooled at ingestion, and the retriever decides how candidates are fetched and reranked at query time. You set the ingestion policy once and every document inherits it, then the retriever stages express the cheap-first, expensive-on-candidates pattern.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

# A collection that keeps multi-vector detail for late interaction,
# but pools vectors at ingestion so the index stays affordable.
collection = client.collections.create(
    namespace_id="ns_docs",
    collection_name="visual-contracts",
    source={"type": "documents"},
    feature_extractors=[
        {
            "feature_extractor_name": "visual_multivector_embedding",
            "parameters": {
                "embedding_model": "colqwen",   # multi-vector visual encoder
                "token_pooling": "similarity",  # cluster + mean redundant patches
                "pooling_factor": 0.5,          # keep about half the vectors
            },
        },
        {
            "feature_extractor_name": "text_chunk_embedding",
            "parameters": {
                "chunk_strategy": "semantic",
                "embedding_model": "dense-text",  # single vector, cheap first stage
            },
        },
    ],
)

The retriever then runs the two-tier plan: a cheap single-vector stage over the whole collection to gather candidates, then a multi-vector late-interaction rerank applied only to those candidates.

# Two-tier: cheap dense candidate fetch, then multi-vector late-interaction rerank.
retriever = client.retrievers.create(
    namespace_id="ns_docs",
    collection_ids=[collection.collection_id],
    retriever_name="visual-contracts-budgeted",
    stages=[
        # Stage 1: cheap, whole corpus. Sets the recall ceiling.
        {"stage_name": "knn_search", "parameters": {"top_k": 200}},
        # Stage 2: expensive late interaction, candidates only. Sets precision.
        {
            "stage_name": "late_interaction_rerank",
            "parameters": {
                "feature": "visual_multivector_embedding",
                "max_query_vectors": 16,  # flexible late interaction budget
                "top_k": 10,
            },
        },
        {"stage_name": "score_threshold", "parameters": {"min_score": 0.4}},
    ],
)

# Cheap queries can run a tighter budget; high-stakes queries raise it.
results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    inputs={"text": "indemnification clause capping liability at $2M on page two"},
    top_k=10,
)

Treat the budget choices as versioned config, not as a tweak. Changing the pooling factor or the rerank vector count changes what is findable and what each query costs, so measure recall at the stage-one cutoff before and after, and roll the change like a migration. Browse available multi-vector and single-vector models on the Models page to pick the encoder pair that fits your accuracy and cost target.

The Problem: Multi-Vector Retrieval Is Accurate and Expensive

A Refresher on Late Interaction Scoring

Lever 1: Token Pooling, Fewer Vectors per Item

Lever 2: Matryoshka Multi-Vector Embeddings

Lever 3: Flexible Late Interaction at Query Time

Lever 4: Two-Tier Retrieve Then Rerank

Putting the Levers Together

What to Measure

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

What Is MUVERA? Turning Multi-Vector Retrieval Into a Single-Vector Search

Late Interaction Retrieval: How ColBERT, ColPali, and ColQwen Search Without Losing Token-Level Detail

Semantic Caching: How Agents Skip Work They Have Already Done