Evaluating Multimodal Retrieval: Metrics, Benchmarks, and Ground Truth

One ranking, four metrics: with relevant results at positions 1, 3, 7, and 10, Precision@5 = 0.40, Recall@5 = 0.50, MRR = 1.0, NDCG@10 = 0.83. The right metric depends on how users read your results.

Why Multimodal Retrieval Evaluation Is Different

Standard information retrieval evaluation assumes a text query, a ranked list of text documents, and binary or graded relevance judgments. Multimodal retrieval breaks all three assumptions.

Mixed-modality relevance. A video result might have the right visual content but wrong audio. An image-text pair might match the query concept but the text describes something else. Relevance is no longer a single dimension -- it decomposes across modalities, and a human assessor must decide whether partial relevance counts.

Temporal alignment. Video and audio are temporal media. A 30-minute video might contain 10 seconds of relevant content. Do you score the entire video as relevant, or only the segment? If your retrieval returns segment-level results, you need metrics that account for temporal precision.

Subjective ground truth. "Find clips that feel energetic" is a valid multimodal query. Unlike factoid search where relevance is objective, perceptual queries introduce annotator disagreement that metrics must accommodate.

Representation diversity. The same concept might be represented as text, an image caption, a video frame embedding, or an audio transcript. Evaluation must test whether retrieval surfaces relevant content regardless of which modality stores it.

These differences mean you cannot simply reuse a TREC-style evaluation harness. You need metrics adapted for partial relevance, ground truth strategies designed for unstructured data, and pipeline-aware attribution that tells you which stage is failing.

Core Metrics Adapted for Multimodal

The foundational ranking metrics -- nDCG, MRR, Recall@k, MAP -- remain the right starting point. The adaptation is in how you assign relevance grades and how you define the unit of retrieval.

nDCG (Normalized Discounted Cumulative Gain)

nDCG measures ranking quality with graded relevance. For multimodal retrieval, expand the relevance scale to capture modality-specific matching:

Grade 3 (Fully relevant): All modalities match the query intent. A video search for "dog catching frisbee" returns a video where the visual shows a dog catching a frisbee and the audio/caption confirms it.

Grade 2 (Partially relevant): Primary modality matches, but secondary modalities do not. The visual shows a dog catching a frisbee, but the transcript discusses something unrelated.

Grade 1 (Weakly relevant): Conceptually related but not a direct match. A video of a dog playing fetch (no frisbee).

Grade 0 (Not relevant): No modality matches.

The formula stays the same: DCG@k = sum of (2^rel_i - 1) / log2(i + 1) for positions 1..k, normalized by the ideal ranking's DCG. What changes is the annotation protocol -- annotators must assess each modality independently, then combine into a single grade.

MRR (Mean Reciprocal Rank)

MRR asks: how far down the list is the first relevant result? For multimodal search, define "relevant" carefully. If you use the fully-relevant threshold (grade 3), MRR penalizes systems that surface partially-relevant results early. If you use a lower threshold, MRR rewards systems that find any modality match quickly.

Recommendation: Report MRR at multiple thresholds. MRR@3 (fully relevant) tells you precision; MRR@1+ (any relevance) tells you coverage. The gap between them reveals how often your system finds partially-matching results.

Recall@k

Recall@k measures the fraction of all relevant items that appear in the top k results. In multimodal retrieval, the denominator is tricky: how many items in your corpus are relevant to the query?

For large corpora, exhaustive annotation is impractical. Two strategies:

1. Pooling: Collect the top-k results from multiple retrieval systems, annotate only the union, and compute recall over the annotated pool. This is the standard TREC approach and works well for multimodal when you have multiple system variants to pool from.

2. Recall relative to a known set: If you have a curated dataset with known-relevant items (e.g., you know 50 videos in the corpus contain dogs catching frisbees), compute recall against that known set. This avoids exhaustive annotation but requires upfront investment in building the known set.

MAP (Mean Average Precision)

MAP averages precision at each recall point. It is sensitive to the entire ranking, not just the top few positions. For multimodal retrieval, MAP is most useful when you care about exhaustive recall (find ALL relevant items), which is common in legal discovery, compliance auditing, and media asset search.

Building Ground Truth for Unstructured Data

The hardest part of multimodal evaluation is creating reliable ground truth. Text corpora can be annotated by reading; image corpora require visual inspection; video corpora demand watching. At scale, manual annotation is prohibitively expensive.

Strategy 1: LLM-as-Judge

Use a vision-language model (VLM) to generate relevance judgments. The approach:

1. Extract features from each item in your corpus using your existing pipeline -- frame descriptions, transcripts, detected objects, OCR text. 2. For each (query, result) pair, prompt a VLM with the extracted features and the query. Ask it to rate relevance on your grading scale. 3. Calibrate by comparing VLM judgments against a small set of human annotations (100-200 pairs). Compute Cohen's kappa to measure agreement.

Strengths: Scales to millions of pairs. Can handle complex queries. Recent VLMs (GPT-4o, Gemini, Claude) achieve 0.7+ kappa with human annotators on visual relevance tasks.

Weaknesses: VLMs inherit training biases. They may over-index on text features and under-weight visual composition. Always validate against human judgments for your specific domain.

Strategy 2: Synthetic Query Generation

Instead of annotating existing queries, generate queries from your corpus:

1. Pick a random item (video, image, document). 2. Use a VLM or feature extraction pipeline to describe the item. 3. Generate a query that this item would be relevant to. 4. The (query, item) pair is now a positive example. Other items are negatives (with varying hardness).

This gives you automatic ground truth, but with a bias toward queries your system can already answer. To counter this, also generate "hard negative" queries -- modify the generated query slightly so the item becomes irrelevant.

Strategy 3: Cross-Modal Agreement Scoring

When your pipeline extracts features across multiple modalities (visual embeddings, transcript text, detected objects), you can use agreement across modalities as a weak supervision signal:

1. For a given query, retrieve results using each modality independently. 2. Items that appear in the top-k of multiple modality-specific retrievals are likely relevant (high agreement). 3. Items that appear in only one modality's top-k are candidates for partial relevance. 4. Items that appear in no top-k are likely irrelevant.

This doesn't produce perfect ground truth, but it produces useful training data for learning-to-rank models and calibration signals for offline evaluation.

Strategy 4: Clustering-Based Evaluation

If your corpus has been clustered or taxonomized (by topic, visual similarity, or detected entities), clusters provide a natural evaluation structure:

1. A query that maps to a specific cluster should retrieve items from that cluster. 2. Precision = fraction of top-k results from the correct cluster. 3. Recall = fraction of cluster items in top-k.

This is particularly effective for exploratory search evaluation, where queries like "find all product photos with red backgrounds" map cleanly to visual clusters.

Benchmark Suites Worth Knowing

Several benchmark suites have emerged for multimodal retrieval evaluation. Each tests different capabilities:

MMEB (Massive Multimodal Embedding Benchmark)

MMEB evaluates multimodal embedding models across 36 datasets spanning 4 meta-categories: classification, retrieval, visual question answering, and visual grounding. It tests whether a single embedding model can handle diverse tasks. Use MMEB when: you are selecting or comparing embedding models for a general-purpose multimodal pipeline. Version to use: MMEB-V2 (2026), which adds video and audio tasks.

ViDoRe (Visual Document Retrieval Benchmark)

ViDoRe evaluates retrieval over document images -- PDFs rendered as images, scanned pages, infographics. It tests whether a model can find relevant documents without OCR. Use ViDoRe when: your corpus is primarily documents (invoices, contracts, reports) and you want to compare visual document retrieval models like ColPali, ColQwen, or ColMate.

BEIR (Benchmarking IR)

BEIR is a text-only benchmark with 18 diverse retrieval datasets. While not multimodal, it remains the standard for evaluating text embedding and reranking models. Use BEIR when: evaluating the text retrieval component of your multimodal pipeline, especially for comparing rerankers.

MRAG-Bench

MRAG-Bench specifically tests multimodal retrieval-augmented generation -- whether retrieved multimodal context actually improves LLM answer quality. Use MRAG-Bench when: your end goal is RAG (not just retrieval) and you want to measure whether better retrieval translates to better answers.

Domain-Specific Benchmarks

VTAB-1k -- Visual task adaptation for image classification and retrieval

MSR-VTT -- Video-text retrieval (10K video clips with captions)

AudioCaps -- Audio-text retrieval (50K audio clips with descriptions)

DocVQA -- Document visual question answering

Practical advice: Start with one general benchmark (MMEB for embeddings, BEIR for text) and one domain-specific benchmark that matches your corpus type. Do not try to optimize for all benchmarks simultaneously -- they test different capabilities and may have conflicting optimal configurations.

Evaluating Multi-Stage Retrieval Pipelines

Modern retrieval pipelines have multiple stages: a first-stage retriever (embedding search), optional filters, and a reranker. Evaluating only the final output hides where quality is lost.

Stage-Level Attribution

For each stage, measure:

1. Retriever recall@100: What fraction of relevant items does the first stage surface in the top 100? If recall@100 is low, the reranker cannot fix it -- relevant items were never in the candidate set.

2. Reranker nDCG@10 on retrieved set: Given the retriever's candidates, how well does the reranker order them? This isolates reranker quality from retriever quality.

3. End-to-end nDCG@10: The final metric that users experience.

The diagnostic pattern:

Retriever recall@100

Reranker nDCG@10

Diagnosis

High	High	System works well
High	Low	Reranker is the bottleneck -- try a stronger cross-encoder
Low	High	Retriever is the bottleneck -- improve embeddings, increase k, or add query expansion
Low	Low	Both stages need work -- start with the retriever

Cross-Modal Stage Attribution

In multimodal pipelines, you may have separate retrieval paths for different modalities (text search, visual search, audio search) that are fused before reranking. Evaluate each path independently:

1. Run text-only retrieval, measure recall. 2. Run visual-only retrieval, measure recall. 3. Run the fused retrieval, measure recall.

If fusion recall exceeds each individual recall, the modalities are complementary. If fusion recall is lower than the best individual recall, your fusion strategy is actively harmful -- likely due to score normalization issues or modality weighting imbalance.

Latency-Quality Tradeoff

Retrieval quality and latency are inversely correlated. Larger candidate sets improve recall but increase reranker latency. Track:

p50 and p99 latency at each stage

Quality at each latency budget: If you constrain total latency to 200ms, what is the best achievable nDCG@10?

Marginal quality per millisecond: Adding a reranker stage that costs 50ms and improves nDCG by 0.02 may not be worth it for your use case.

Online Evaluation: Measuring What Users Actually Experience

Offline metrics tell you how well your system performs on a static evaluation set. Online metrics tell you how well it performs in production.

Implicit Feedback Signals

When users interact with search results, they generate implicit relevance signals:

Click-through rate (CTR): The fraction of results that users click. Higher CTR at higher ranks indicates good ranking. CTR that is uniform across ranks suggests the ranking is no better than random.

Dwell time: How long a user spends on a clicked result. Longer dwell time suggests the result was relevant. Short dwell time (< 5 seconds) followed by a return to results suggests a "bounce" -- the result looked relevant but was not.

Scroll depth: How far down the results page a user scrolls before clicking (or abandoning). Deep scrolling suggests the top results did not satisfy the query.

Reformulation rate: How often users modify their query after seeing results. High reformulation suggests the initial results were off-target.

Agent Task Success Rate

For AI agent workloads, the ultimate evaluation metric is task completion. An agent that searches your corpus and synthesizes an answer is only as good as the retrieval that supports it. Measure:

Answer correctness: Does the agent's answer match the ground truth? (Requires labeled question-answer pairs.)

Retrieval attribution: When the answer is wrong, was the relevant information in the retrieved context? If yes, the generation failed. If no, the retrieval failed.

Tool call efficiency: How many retrieval calls does the agent make before it has enough context? Fewer calls for the same answer quality indicates better retrieval.

A/B Testing Retrieval Changes

When you change your retrieval pipeline (new embeddings, different reranker, updated chunking), run an A/B test:

1. Route a fraction of traffic to the new pipeline. 2. Compare online metrics (CTR, dwell time, task success) between control and treatment. 3. Also compare offline metrics (nDCG on a held-out evaluation set) to correlate online and offline signals.

Critical pitfall: Online metrics lag. A user might click a bad result, read it, and be dissatisfied without any measurable signal. Always pair online metrics with periodic offline evaluation against updated ground truth.

Building an Evaluation Harness

Bringing it all together, here is a practical workflow for evaluating a multimodal retrieval pipeline:

Step 1: Define Your Evaluation Set

Start with 50-100 queries that represent your actual user workload. Include:

Unambiguous queries with clear relevant items (e.g., "red sports car on highway")

Cross-modal queries that require multiple modalities to answer (e.g., "interview where CEO discusses layoffs" -- needs transcript + face detection)

Hard negatives -- queries where similar-but-irrelevant items exist (e.g., "dog catching frisbee" in a corpus that also contains "dog catching ball")

Step 2: Generate Ground Truth

Use a combination of strategies:

Manual annotation for 50-100 queries (gold standard)

LLM-as-judge for 500-1000 queries (silver standard)

Synthetic query generation for continuous regression testing

Step 3: Compute Offline Metrics

For each pipeline configuration, compute:

Recall@10, Recall@100 (retriever quality)

nDCG@10 (ranking quality)

MRR (first-hit quality)

Stage-level attribution (where is quality lost?)

Step 4: Instrument Online Metrics

Log every search interaction with:

Query text and modality

Returned results (IDs and ranks)

User actions (clicks, dwell time, reformulations)

For agent workloads: task success and retrieval attribution

Step 5: Close the Loop

Use online signals to update ground truth:

Highly-clicked results are likely relevant -- add them to ground truth.

Consistently-skipped results at high ranks may be false positives -- review and downgrade.

New queries from production expand your evaluation set.

This creates a flywheel: better evaluation drives better pipeline decisions, which drive better user experience, which generates better evaluation signals.