The Problem: Multi-Vector Retrieval Is Accurate and Expensive
Single-vector retrieval compresses an entire item into one point. A document becomes 1024 numbers, an image becomes 1024 numbers, and a video clip becomes 1024 numbers. Search is then one nearest-neighbor lookup per item. It is cheap, and for many queries it is good enough.
It breaks on compositional queries. Ask for "a blue chart on the second page next to a handwritten note" and a single vector for the page has already blurred those three conditions into one average. The information that would let you tell a matching page from a near-miss has been pooled away before the query ever arrives.
Multi-vector retrieval keeps the detail. Instead of one vector per item, it stores many: one per text token, one per image patch, one per video frame. Late interaction then scores a query against an item by matching each query vector to its best counterpart in the item and summing those matches. This is the family that includes ColBERT for text, ColPali and ColQwen for visual documents, and the multimodal encoders that embed text, image patches, and video frames into one shared space.
The accuracy gain is real and well established. The cost is the catch, and it shows up in three places at once:
The naive reaction is to give up multi-vector retrieval and go back to single vectors. The better reaction, and the dominant research thread of 2025 and 2026, is to treat the number of vectors as a knob you can turn per collection and per query. That is what this guide teaches.
A Refresher on Late Interaction Scoring
You need the scoring rule in your head before the cost optimizations make sense. Late interaction scores a query against an item with MaxSim:
score(query, item) =
sum over each query_vector q of (
max over each item_vector d of ( similarity(q, d) )
)
In words: for every query vector, find its single best-matching vector in the item, take that similarity, and add up those best matches across all query vectors. A query vector that has no good match in the item contributes almost nothing, so an item is rewarded only when it can cover every part of the query somewhere.
Two facts about this rule drive every cost decision below:
1. The score is dominated by a small number of strong matches. Most item vectors never become the maximum for any query vector. They sit in the index, consume storage, and are scored, but they do not change the answer. 2. The cost is the product of query vector count and item vector count. Shrink either side and you shrink storage, index size, and scoring time together.
So the whole game is: keep the vectors that win MaxSim, drop the ones that never do, and only pay for high resolution when the query needs it.
Lever 1: Token Pooling, Fewer Vectors per Item
The first lever reduces vectors at index time. The observation is that neighboring patches and tokens are highly redundant. Adjacent patches of the same sky, or three subword tokens of one rare word, point in nearly the same direction. Storing all of them is paying full price for near-duplicates.
Token pooling clusters an item's vectors and replaces each cluster with one representative, usually the mean. A common recipe:
1. Take the item's N vectors (for example 1030 patch vectors).
2. Cluster them into K groups by cosine similarity (hierarchical or k-means).
3. Replace each group with its mean vector, L2-normalized.
4. Index the K pooled vectors instead of the original N.
The striking empirical result, reported across visual and text late-interaction work, is that pooling to roughly one half or one third of the vectors costs almost nothing in retrieval quality. MaxSim only ever used the best match in each region, and the pooled centroid of a redundant region is a faithful stand-in for that best match. You delete vectors that were never going to win.
This is training-free. You do not retrain the encoder. You run a clustering pass during ingestion and store fewer, denser vectors. It composes with everything below.
Where pooling can hurt: regions that are small but semantically critical, such as a single line of fine-print disclaimer text on an otherwise empty page, can get averaged into a neighbor and lose their distinct direction. The guard is to pool by similarity rather than by fixed spatial grid, so a visually distinct region forms its own cluster instead of being folded into a large bland one.
Lever 2: Matryoshka Multi-Vector Embeddings
The second lever is structural and it is the idea behind 2026's flexible late-interaction models such as MetaEmbed. Instead of producing a flat bag of equal vectors, the encoder is trained so that the vectors are ordered by importance, coarse to fine.
The training trick borrows from Matryoshka representation learning, which originally nested information inside a single vector so that the first 64 dimensions were a usable embedding on their own, the first 256 were better, and the full vector best of all. Matryoshka multi-vector retrieval lifts that idea from dimensions to whole vectors. A small set of learnable "meta tokens" is appended to the input, and contrastive training is applied in parallel over nested groups of those tokens: the first group must work alone, the first two groups must work together, and so on up to the full set.
The result is an encoder whose output vectors are ranked. The top few vectors carry the gist. Adding more vectors adds resolution. Because the ordering is learned, the first vector is genuinely the most useful one to keep, not an arbitrary token.
Item vectors after Matryoshka training (ordered by importance):
v1 v2 v3 v4 v5 v6 v7 v8
v9 ... v32
coarse add add full detail
(gist) structure nuance
Pick a prefix length at query time:
budget = tight -> use v1..v2
budget = normal -> use v1..v8
budget = max -> use v1..v32
This converts the vector count from a fixed property baked in at index time into a dial you set at query time. You index the full set once. You decide per query how many of those ordered vectors to actually load and score.
Lever 3: Flexible Late Interaction at Query Time
With ordered vectors in hand, flexible late interaction is straightforward: run MaxSim using only the first k query vectors against the first m item vectors, where k and m are chosen from your budget rather than fixed.
flex_score(query, item, k, m) =
sum over first k query_vectors q of (
max over first m item_vectors d of ( similarity(q, d) )
)
The accuracy-versus-cost curve is smooth. Small k and m give you most of the quality for a fraction of the compute, and you climb toward full late-interaction accuracy as you raise them. In practice you choose the operating point per workload:
| Workload | Vectors used | Why |
| Autocomplete or typeahead | top 1 to 2 | Latency budget is a few milliseconds; coarse match is fine |
| Interactive agent search | top 8 to 16 | Balanced; covers most compositional queries |
| High-stakes review or recall sweep | full set | Maximum recall matters more than cost |
Lever 4: Two-Tier Retrieve Then Rerank
The three levers above shrink the per-item work. The fourth lever shrinks the candidate set, and it is the pattern you should reach for first because it gives the largest win for the least change.
Full multi-vector scoring over a whole corpus is wasteful because almost every item is obviously irrelevant. The fix is a cheap first stage that produces a short candidate list, followed by expensive late interaction applied only to those candidates.
Stage 1 (cheap, whole corpus):
single-vector ANN search, OR
pooled / coarse multi-vector search (top few vectors only)
-> return top 200 candidates
Stage 2 (expensive, candidates only):
full or near-full multi-vector late interaction
rerank the 200 -> return top 10
The first stage can be a pooled single vector per item, a mean of the item's vectors, or a fixed-dimensional encoding that approximates the multi-vector set so a standard nearest-neighbor index can retrieve candidates. Approaches such as MUVERA build exactly such a fixed encoding so the candidate stage stays a normal vector search. The second stage then pays the full late-interaction price on 200 items instead of one million.
Two design rules keep this honest:
This two-tier shape is the same skeleton as multi-stage retrieval generally. The multi-vector specific point is that stages one and two can use the same ordered embedding at different prefix lengths: a tiny prefix to retrieve, the full set to rerank.
Putting the Levers Together
The four levers stack, and a sensible default stack looks like this:
1. At ingestion, pool each item's vectors down to roughly one half, using similarity-based clustering so distinct regions survive. This is a permanent storage and index-size win. 2. Train or choose a Matryoshka multi-vector encoder if you can, so the surviving vectors are ordered by importance and the query-time dial is meaningful. 3. At query time, run a cheap first stage over the whole corpus to get a few hundred candidates. 4. Rerank the candidates with flexible late interaction, choosing the prefix length from the query's budget.
The compounding effect is large. Halving vectors at ingestion, retrieving with a coarse prefix, and reranking only a few hundred candidates can drop end-to-end cost by an order of magnitude while keeping you within a point or two of full multi-vector accuracy. That is the difference between multi-vector retrieval being a benchmark result and being something you can afford to run on every agent query.
What to Measure
Do not trust any of these optimizations without a recall check, because every one of them trades quality for cost somewhere.
Pick an operating point by walking the curve: start from full multi-vector accuracy, turn one lever at a time, and stop when the recall delta crosses the line your application can tolerate.
Doing This in Mixpeek
In Mixpeek, the vector budget lives in two places: the collection decides what gets embedded and pooled at ingestion, and the retriever decides how candidates are fetched and reranked at query time. You set the ingestion policy once and every document inherits it, then the retriever stages express the cheap-first, expensive-on-candidates pattern.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
# A collection that keeps multi-vector detail for late interaction,
# but pools vectors at ingestion so the index stays affordable.
collection = client.collections.create(
namespace_id="ns_docs",
collection_name="visual-contracts",
source={"type": "documents"},
feature_extractors=[
{
"feature_extractor_name": "visual_multivector_embedding",
"parameters": {
"embedding_model": "colqwen", # multi-vector visual encoder
"token_pooling": "similarity", # cluster + mean redundant patches
"pooling_factor": 0.5, # keep about half the vectors
},
},
{
"feature_extractor_name": "text_chunk_embedding",
"parameters": {
"chunk_strategy": "semantic",
"embedding_model": "dense-text", # single vector, cheap first stage
},
},
],
)
The retriever then runs the two-tier plan: a cheap single-vector stage over the whole collection to gather candidates, then a multi-vector late-interaction rerank applied only to those candidates.
# Two-tier: cheap dense candidate fetch, then multi-vector late-interaction rerank.
retriever = client.retrievers.create(
namespace_id="ns_docs",
collection_ids=[collection.collection_id],
retriever_name="visual-contracts-budgeted",
stages=[
# Stage 1: cheap, whole corpus. Sets the recall ceiling.
{"stage_name": "knn_search", "parameters": {"top_k": 200}},
# Stage 2: expensive late interaction, candidates only. Sets precision.
{
"stage_name": "late_interaction_rerank",
"parameters": {
"feature": "visual_multivector_embedding",
"max_query_vectors": 16, # flexible late interaction budget
"top_k": 10,
},
},
{"stage_name": "score_threshold", "parameters": {"min_score": 0.4}},
],
)
# Cheap queries can run a tighter budget; high-stakes queries raise it.
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": "indemnification clause capping liability at $2M on page two"},
top_k=10,
)
Treat the budget choices as versioned config, not as a tweak. Changing the pooling factor or the rerank vector count changes what is findable and what each query costs, so measure recall at the stage-one cutoff before and after, and roll the change like a migration. Browse available multi-vector and single-vector models on the Models page to pick the encoder pair that fits your accuracy and cost target.