We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn't been published before: ColQwen2 + MUVERA. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search.
The Problem
Late interaction models like ColBERT and ColPali represent documents as sets of vectors—one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/Chamfer similarity). This gives near cross-encoder accuracy, but retrieval cost is O(|Q| × |P| × n)—intractable at scale.
The prior solution, PLAID, uses heuristic centroid pruning with no theoretical guarantees. It degrades unpredictably on some datasets.
MUVERA: The Fix
MUVERA (Google Research, NeurIPS 2024) converts any multi-vector set into a single fixed-dimensional encoding (FDE) whose inner product provably approximates Chamfer similarity. This means you can use standard ANN engines (HNSW, DiskANN) for first-pass retrieval, then re-rank a small candidate set with true MaxSim.
The key insight is asymmetric encoding: documents get centroids with empty-cluster filling (preserves information), queries get sums with no filling (preserves distribution). Random hyperplane partitioning + dimensionality reduction, repeated R times and concatenated. The math gives you an ε-approximation guarantee—the first such result for multi-vector retrieval.
The Benchmark
We ran ColPali-v1.2 and ColQwen2-v1.0 against BM25 (OCR + Tesseract) on the ViDoRe TabFQuAD dataset—70 financial table images, 280 queries. This is the hard case: charts, multi-column tables, footnotes, mixed text+visual content where OCR systematically fails.
MUVERA config: k_sim=5, d_proj=16, r_reps=20 → 10,240-dimensional FDE per document.
| Method | R@1 | R@5 | NDCG@10 | MRR | Latency |
| BM25 (OCR text) | 0.425 | 0.650 | 0.570 | 0.531 | 0.4ms |
| ColPali-v1.2 brute-force | 0.825 | 0.929 | 0.890 | 0.872 | 26.4ms |
| ColQwen2-v1.0 brute-force | 0.839 | 0.932 | 0.896 | 0.883 | 42.8ms |
| MUVERA FDE only | 0.693 | 0.854 | 0.791 | 0.759 | 0.2ms |
| MUVERA + rerank (ColQwen2) | 0.836 | 0.925 | 0.891 | 0.877 | 30.2ms |
What the Numbers Mean
BM25 is not competitive. OCR + keyword search reaches 63.6% of the visual model's quality on financial tables. If your pipeline is “extract text then BM25,” you're leaving 36% of retrieval quality on the floor for any document with visual structure.
MUVERA + rerank = 99.4% quality retention. The FDE narrows 70 documents to 50 candidates in 0.2ms, then Chamfer re-ranking recovers essentially all of the brute-force accuracy. At 1M documents, brute-force becomes seconds; MUVERA stays at milliseconds.
MUVERA FDE-only at 179× speedup. For applications where you can tolerate ~12% quality loss, pure FDE search gives sub-millisecond retrieval. This is the operating point for real-time serving at scale.
ColQwen2 > ColPali by +0.7% NDCG@10 on this dataset, with a larger gap (~6%) on the full ViDoRe average. ColQwen2 is Apache 2.0 licensed and 2B parameters—smaller than ColPali's 3B.
The Two-Tier Architecture
The production pattern that falls out of this:
- Offline: Embed documents with ColQwen2 → ~620 patch vectors per page (128-dim each). Generate MUVERA FDE (10,240-dim single vector). Index FDEs in any standard ANN engine.
- Tier 1—candidate generation: Query → ColQwen2 query embedding → MUVERA query FDE → ANN search → top-K candidates. Cost: O(log n). Latency: <1ms.
- Tier 2—precision re-ranking: Load candidate multi-vectors from storage → true Chamfer/MaxSim scoring → final ranked list. Cost: O(K × |patches|). Latency: ~30ms for K=50.
FDEs go into your vector index as ordinary single vectors. Multi-vectors stay in object storage (S3/parquet) and only get loaded for the re-rank stage. No new infrastructure—just a smarter encoding layer.
What This Means for Multimodal Search
The combination of a strong vision-language model (ColQwen2) with a theoretically-grounded retrieval engine (MUVERA) makes multi-vector search practical at scale for the first time. Prior approaches either sacrificed quality (single-vector), sacrificed speed (brute-force), or sacrificed guarantees (PLAID).
The verticals where this matters most: financial document search (tables, charts, filings), medical imaging (radiology reports with embedded scans), legal discovery (scanned contracts with annotations), and any domain where OCR is the current bottleneck.
Links