NEWVectors or files. Pick a path.Start →
    Retrieval
    17 min read
    Updated 2026-06-18

    Budget-Aware Multi-Vector Retrieval: How to Make Late Interaction Affordable at Scale

    Multi-vector retrieval lets an agent search images, documents, and video at the patch and frame level, but it can store hundreds of vectors per item. This guide teaches the algorithms that control that cost: token pooling, Matryoshka multi-vector embeddings, flexible late interaction, and two-tier retrieve-then-rerank, then shows the pattern in Mixpeek.

    Late Interaction
    Multi-Vector Retrieval
    MetaEmbed
    Token Pooling
    Matryoshka Embeddings
    Retrieval

    The Problem: Multi-Vector Retrieval Is Accurate and Expensive



    Single-vector retrieval compresses an entire item into one point. A document becomes 1024 numbers, an image becomes 1024 numbers, and a video clip becomes 1024 numbers. Search is then one nearest-neighbor lookup per item. It is cheap, and for many queries it is good enough.

    It breaks on compositional queries. Ask for "a blue chart on the second page next to a handwritten note" and a single vector for the page has already blurred those three conditions into one average. The information that would let you tell a matching page from a near-miss has been pooled away before the query ever arrives.

    Multi-vector retrieval keeps the detail. Instead of one vector per item, it stores many: one per text token, one per image patch, one per video frame. Late interaction then scores a query against an item by matching each query vector to its best counterpart in the item and summing those matches. This is the family that includes ColBERT for text, ColPali and ColQwen for visual documents, and the multimodal encoders that embed text, image patches, and video frames into one shared space.

    The accuracy gain is real and well established. The cost is the catch, and it shows up in three places at once:

  1. Storage. A page rendered as an image can produce 1030 patch vectors. At 128 dimensions in float32 that is roughly 527 KB per page. One million pages is about 527 GB of vectors, before any index overhead. The same corpus in single-vector form is a few GB.
  2. Memory and index size. Every one of those vectors lives in the nearest-neighbor index. The index does not grow with your number of items, it grows with your number of vectors, which is hundreds of times larger.
  3. Scoring compute. Late interaction compares every query vector to every candidate vector. A 32-token query against a 1030-vector page is 32 times 1030 dot products for a single item. Multiply by your candidate set and per-query latency climbs fast.


  4. The naive reaction is to give up multi-vector retrieval and go back to single vectors. The better reaction, and the dominant research thread of 2025 and 2026, is to treat the number of vectors as a knob you can turn per collection and per query. That is what this guide teaches.

    A Refresher on Late Interaction Scoring



    You need the scoring rule in your head before the cost optimizations make sense. Late interaction scores a query against an item with MaxSim:

    score(query, item) =
      sum over each query_vector q of (
        max over each item_vector d of ( similarity(q, d) )
      )
    


    In words: for every query vector, find its single best-matching vector in the item, take that similarity, and add up those best matches across all query vectors. A query vector that has no good match in the item contributes almost nothing, so an item is rewarded only when it can cover every part of the query somewhere.

    Two facts about this rule drive every cost decision below:

    1. The score is dominated by a small number of strong matches. Most item vectors never become the maximum for any query vector. They sit in the index, consume storage, and are scored, but they do not change the answer. 2. The cost is the product of query vector count and item vector count. Shrink either side and you shrink storage, index size, and scoring time together.

    So the whole game is: keep the vectors that win MaxSim, drop the ones that never do, and only pay for high resolution when the query needs it.

    Lever 1: Token Pooling, Fewer Vectors per Item



    The first lever reduces vectors at index time. The observation is that neighboring patches and tokens are highly redundant. Adjacent patches of the same sky, or three subword tokens of one rare word, point in nearly the same direction. Storing all of them is paying full price for near-duplicates.

    Token pooling clusters an item's vectors and replaces each cluster with one representative, usually the mean. A common recipe:

    1. Take the item's N vectors (for example 1030 patch vectors).
    2. Cluster them into K groups by cosine similarity (hierarchical or k-means).
    3. Replace each group with its mean vector, L2-normalized.
    4. Index the K pooled vectors instead of the original N.
    


    The striking empirical result, reported across visual and text late-interaction work, is that pooling to roughly one half or one third of the vectors costs almost nothing in retrieval quality. MaxSim only ever used the best match in each region, and the pooled centroid of a redundant region is a faithful stand-in for that best match. You delete vectors that were never going to win.

    This is training-free. You do not retrain the encoder. You run a clustering pass during ingestion and store fewer, denser vectors. It composes with everything below.

    Where pooling can hurt: regions that are small but semantically critical, such as a single line of fine-print disclaimer text on an otherwise empty page, can get averaged into a neighbor and lose their distinct direction. The guard is to pool by similarity rather than by fixed spatial grid, so a visually distinct region forms its own cluster instead of being folded into a large bland one.

    Lever 2: Matryoshka Multi-Vector Embeddings



    The second lever is structural and it is the idea behind 2026's flexible late-interaction models such as MetaEmbed. Instead of producing a flat bag of equal vectors, the encoder is trained so that the vectors are ordered by importance, coarse to fine.

    The training trick borrows from Matryoshka representation learning, which originally nested information inside a single vector so that the first 64 dimensions were a usable embedding on their own, the first 256 were better, and the full vector best of all. Matryoshka multi-vector retrieval lifts that idea from dimensions to whole vectors. A small set of learnable "meta tokens" is appended to the input, and contrastive training is applied in parallel over nested groups of those tokens: the first group must work alone, the first two groups must work together, and so on up to the full set.

    The result is an encoder whose output vectors are ranked. The top few vectors carry the gist. Adding more vectors adds resolution. Because the ordering is learned, the first vector is genuinely the most useful one to keep, not an arbitrary token.

    Item vectors after Matryoshka training (ordered by importance):
      v1  v2   
    v3 v4v5 v6 v7 v8
    v9 ... v32 coarse add add full detail (gist) structure nuance

    Pick a prefix length at query time: budget = tight -> use v1..v2 budget = normal -> use v1..v8 budget = max -> use v1..v32


    This converts the vector count from a fixed property baked in at index time into a dial you set at query time. You index the full set once. You decide per query how many of those ordered vectors to actually load and score.

    Lever 3: Flexible Late Interaction at Query Time



    With ordered vectors in hand, flexible late interaction is straightforward: run MaxSim using only the first k query vectors against the first m item vectors, where k and m are chosen from your budget rather than fixed.

    flex_score(query, item, k, m) =
      sum over first k query_vectors q of (
        max over first m item_vectors d of ( similarity(q, d) )
      )
    


    The accuracy-versus-cost curve is smooth. Small k and m give you most of the quality for a fraction of the compute, and you climb toward full late-interaction accuracy as you raise them. In practice you choose the operating point per workload:

    WorkloadVectors usedWhy
    Autocomplete or typeaheadtop 1 to 2Latency budget is a few milliseconds; coarse match is fine
    Interactive agent searchtop 8 to 16Balanced; covers most compositional queries
    High-stakes review or recall sweepfull setMaximum recall matters more than cost
    The agent can even pick the budget from the query itself. A short factual lookup runs cheap. A long multi-condition query, or a query the agent flags as high-stakes, runs at full resolution. The retrieval cost now tracks the value of the question instead of being a flat tax on every call.

    Lever 4: Two-Tier Retrieve Then Rerank



    The three levers above shrink the per-item work. The fourth lever shrinks the candidate set, and it is the pattern you should reach for first because it gives the largest win for the least change.

    Full multi-vector scoring over a whole corpus is wasteful because almost every item is obviously irrelevant. The fix is a cheap first stage that produces a short candidate list, followed by expensive late interaction applied only to those candidates.

    Stage 1  (cheap, whole corpus):
      single-vector ANN search, OR
      pooled / coarse multi-vector search (top few vectors only)
      -> return top 200 candidates

    Stage 2 (expensive, candidates only): full or near-full multi-vector late interaction rerank the 200 -> return top 10


    The first stage can be a pooled single vector per item, a mean of the item's vectors, or a fixed-dimensional encoding that approximates the multi-vector set so a standard nearest-neighbor index can retrieve candidates. Approaches such as MUVERA build exactly such a fixed encoding so the candidate stage stays a normal vector search. The second stage then pays the full late-interaction price on 200 items instead of one million.

    Two design rules keep this honest:

  5. The first stage sets your recall ceiling. If a relevant item is not in the 200 candidates, no amount of careful reranking in stage two can recover it. Tune stage-one candidate count by measuring recall at that cutoff, not by guessing.
  6. The second stage sets your precision. This is where the fine-grained MaxSim matching earns its cost, separating the truly matching item from the near-misses the cheap stage could not tell apart.


  7. This two-tier shape is the same skeleton as multi-stage retrieval generally. The multi-vector specific point is that stages one and two can use the same ordered embedding at different prefix lengths: a tiny prefix to retrieve, the full set to rerank.

    Putting the Levers Together



    The four levers stack, and a sensible default stack looks like this:

    1. At ingestion, pool each item's vectors down to roughly one half, using similarity-based clustering so distinct regions survive. This is a permanent storage and index-size win. 2. Train or choose a Matryoshka multi-vector encoder if you can, so the surviving vectors are ordered by importance and the query-time dial is meaningful. 3. At query time, run a cheap first stage over the whole corpus to get a few hundred candidates. 4. Rerank the candidates with flexible late interaction, choosing the prefix length from the query's budget.

    The compounding effect is large. Halving vectors at ingestion, retrieving with a coarse prefix, and reranking only a few hundred candidates can drop end-to-end cost by an order of magnitude while keeping you within a point or two of full multi-vector accuracy. That is the difference between multi-vector retrieval being a benchmark result and being something you can afford to run on every agent query.

    What to Measure



    Do not trust any of these optimizations without a recall check, because every one of them trades quality for cost somewhere.

  8. Recall at the stage-one cutoff. The single most important number. It caps everything downstream.
  9. End-to-end recall and precision at k, with the full uncompressed multi-vector pipeline as your ground-truth reference. Each lever should be evaluated as a delta against that reference.
  10. Cost per query, broken into storage footprint, candidate-stage latency, and rerank-stage latency, so you know which lever to turn when you need to go faster or cheaper.
  11. Tail latency, not just the average. The high-resolution queries are the ones that matter most and also the slowest. Watch the p95.


  12. Pick an operating point by walking the curve: start from full multi-vector accuracy, turn one lever at a time, and stop when the recall delta crosses the line your application can tolerate.

    Doing This in Mixpeek



    In Mixpeek, the vector budget lives in two places: the collection decides what gets embedded and pooled at ingestion, and the retriever decides how candidates are fetched and reranked at query time. You set the ingestion policy once and every document inherits it, then the retriever stages express the cheap-first, expensive-on-candidates pattern.

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="mxp_sk_...")

    # A collection that keeps multi-vector detail for late interaction, # but pools vectors at ingestion so the index stays affordable. collection = client.collections.create( namespace_id="ns_docs", collection_name="visual-contracts", source={"type": "documents"}, feature_extractors=[ { "feature_extractor_name": "visual_multivector_embedding", "parameters": { "embedding_model": "colqwen", # multi-vector visual encoder "token_pooling": "similarity", # cluster + mean redundant patches "pooling_factor": 0.5, # keep about half the vectors }, }, { "feature_extractor_name": "text_chunk_embedding", "parameters": { "chunk_strategy": "semantic", "embedding_model": "dense-text", # single vector, cheap first stage }, }, ], )


    The retriever then runs the two-tier plan: a cheap single-vector stage over the whole collection to gather candidates, then a multi-vector late-interaction rerank applied only to those candidates.

    # Two-tier: cheap dense candidate fetch, then multi-vector late-interaction rerank.
    retriever = client.retrievers.create(
        namespace_id="ns_docs",
        collection_ids=[collection.collection_id],
        retriever_name="visual-contracts-budgeted",
        stages=[
            # Stage 1: cheap, whole corpus. Sets the recall ceiling.
            {"stage_name": "knn_search", "parameters": {"top_k": 200}},
            # Stage 2: expensive late interaction, candidates only. Sets precision.
            {
                "stage_name": "late_interaction_rerank",
                "parameters": {
                    "feature": "visual_multivector_embedding",
                    "max_query_vectors": 16,  # flexible late interaction budget
                    "top_k": 10,
                },
            },
            {"stage_name": "score_threshold", "parameters": {"min_score": 0.4}},
        ],
    )

    # Cheap queries can run a tighter budget; high-stakes queries raise it. results = client.retrievers.execute( retriever_id=retriever.retriever_id, inputs={"text": "indemnification clause capping liability at $2M on page two"}, top_k=10, )


    Treat the budget choices as versioned config, not as a tweak. Changing the pooling factor or the rerank vector count changes what is findable and what each query costs, so measure recall at the stage-one cutoff before and after, and roll the change like a migration. Browse available multi-vector and single-vector models on the Models page to pick the encoder pair that fits your accuracy and cost target.

    Further Reading



  13. Late Interaction Retrieval -- the architecture this guide makes affordable, with MaxSim and the ColBERT to ColQwen evolution
  14. Multi-Stage Retrieval -- the general cheap-first, expensive-on-candidates pattern
  15. Cross-Encoder Reranking -- the other way to spend compute precisely on a small candidate set
  16. Embedding Quantization and Compression -- shrinking each vector, complementary to shrinking the vector count
  17. Approximate Nearest Neighbor Search -- the index that powers the cheap candidate stage
  18. Calibrating Similarity Scores -- making the scores from each stage comparable and meaningful
  19. Models -- browse multi-vector and single-vector embedding models
  20. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Retrieval

    Late Interaction Retrieval: How ColBERT, ColPali, and ColQwen Search Without Losing Token-Level Detail

    A technical guide to late interaction retrieval models — the architecture that keeps per-token embeddings instead of compressing everything into a single vector. Covers how MaxSim scoring works, why it outperforms dense retrieval on complex queries, the evolution from ColBERT (text) to ColPali (visual documents) to ColQwen (multimodal), and how to build retrieval pipelines that combine late interaction with dense search.

    Read guide →
    Retrieval

    BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

    Every hybrid search pipeline pairs dense vectors with BM25, but almost no one can say where the BM25 number actually comes from, which is exactly why fusion, tuning, and exact-match failures stay mysterious. This guide opens the box: how an inverted index turns transcripts and OCR text into posting lists, the precise BM25 scoring formula with its term-frequency saturation and length normalization, what the k1 and b parameters really do, and why the tokenizer is the silent decider of whether an agent ever finds a serial number.

    Read guide →
    Retrieval

    Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

    An agent searching transcripts, OCR text, and captions needs both meaning (dense vectors) and exact terms (BM25), but the two return scores on incompatible scales that you cannot simply add. This guide teaches the real fusion mechanics: why score distributions make naive normalization fail, the exact math of Reciprocal Rank Fusion and how its k parameter behaves, weighted convex combination with proper normalization, and how to choose and tune a fusion method against a labeled set.

    Read guide →