ColQwen2 + MUVERA: Multimodal Retrieval That Scales

ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales

We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn't been published before: ColQwen2 + MUVERA. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search. The Problem Late interaction models like ColBERT and ColPali represent documents as sets of vectors—one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/

Engineering

We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn't been published before: ColQwen2 + MUVERA. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search.

The Problem

Late interaction models like ColBERT and ColPali represent documents as sets of vectors—one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/Chamfer similarity). This gives near cross-encoder accuracy, but retrieval cost is O(|Q| × |P| × n)—intractable at scale.

The prior solution, PLAID, uses heuristic centroid pruning with no theoretical guarantees. It degrades unpredictably on some datasets.

MUVERA: The Fix

MUVERA (Google Research, NeurIPS 2024) converts any multi-vector set into a single fixed-dimensional encoding (FDE) whose inner product provably approximates Chamfer similarity. This means you can use standard ANN engines (HNSW, DiskANN) for first-pass retrieval, then re-rank a small candidate set with true MaxSim.

The key insight is asymmetric encoding: documents get centroids with empty-cluster filling (preserves information), queries get sums with no filling (preserves distribution). Random hyperplane partitioning + dimensionality reduction, repeated R times and concatenated. The math gives you an ε-approximation guarantee—the first such result for multi-vector retrieval.

The Benchmark

We ran ColPali-v1.2 and ColQwen2-v1.0 against BM25 (OCR + Tesseract) on the ViDoRe TabFQuAD dataset—70 financial table images, 280 queries. This is the hard case: charts, multi-column tables, footnotes, mixed text+visual content where OCR systematically fails.

MUVERA config: k_sim=5, d_proj=16, r_reps=20 → 10,240-dimensional FDE per document.

Method	R@1	R@5	NDCG@10	MRR	Latency
BM25 (OCR text)	0.425	0.650	0.570	0.531	0.4ms
ColPali-v1.2 brute-force	0.825	0.929	0.890	0.872	26.4ms
ColQwen2-v1.0 brute-force	0.839	0.932	0.896	0.883	42.8ms
MUVERA FDE only	0.693	0.854	0.791	0.759	0.2ms
MUVERA + rerank (ColQwen2)	0.836	0.925	0.891	0.877	30.2ms

What the Numbers Mean

BM25 is not competitive. OCR + keyword search reaches 63.6% of the visual model's quality on financial tables. If your pipeline is “extract text then BM25,” you're leaving 36% of retrieval quality on the floor for any document with visual structure.

MUVERA + rerank = 99.4% quality retention. The FDE narrows 70 documents to 50 candidates in 0.2ms, then Chamfer re-ranking recovers essentially all of the brute-force accuracy. At 1M documents, brute-force becomes seconds; MUVERA stays at milliseconds.

MUVERA FDE-only at 179× speedup. For applications where you can tolerate ~12% quality loss, pure FDE search gives sub-millisecond retrieval. This is the operating point for real-time serving at scale.

ColQwen2 > ColPali by +0.7% NDCG@10 on this dataset, with a larger gap (~6%) on the full ViDoRe average. ColQwen2 is Apache 2.0 licensed and 2B parameters—smaller than ColPali's 3B.

The Two-Tier Architecture

The production pattern that falls out of this:

Offline: Embed documents with ColQwen2 → ~620 patch vectors per page (128-dim each). Generate MUVERA FDE (10,240-dim single vector). Index FDEs in any standard ANN engine.
Tier 1—candidate generation: Query → ColQwen2 query embedding → MUVERA query FDE → ANN search → top-K candidates. Cost: O(log n). Latency: <1ms.
Tier 2—precision re-ranking: Load candidate multi-vectors from storage → true Chamfer/MaxSim scoring → final ranked list. Cost: O(K × |patches|). Latency: ~30ms for K=50.

FDEs go into your vector index as ordinary single vectors. Multi-vectors stay in object storage (S3/parquet) and only get loaded for the re-rank stage. No new infrastructure—just a smarter encoding layer.

What This Means for Multimodal Search

The combination of a strong vision-language model (ColQwen2) with a theoretically-grounded retrieval engine (MUVERA) makes multi-vector search practical at scale for the first time. Prior approaches either sacrificed quality (single-vector), sacrificed speed (brute-force), or sacrificed guarantees (PLAID).

The verticals where this matters most: financial document search (tables, charts, filings), medical imaging (radiology reports with embedded scans), legal discovery (scanned contracts with annotations), and any domain where OCR is the current bottleneck.