Quickstart
Create a ground truth dataset
Each query pairs an input with the document IDs that should be returned.
Run the evaluation
Execute your retriever against every query in the dataset. The evaluation runs asynchronously and returns a
task_id for progress tracking.Metrics
Every metric is computed at each K value you specify. The defaults cover most use cases:| Metric | What It Measures | Formula |
|---|---|---|
| Precision@K | Accuracy of top results | relevant in top K ÷ K |
| Recall@K | Coverage of relevant documents | relevant in top K ÷ total relevant |
| F1@K | Balanced precision/recall | harmonic mean of P and R |
| F2@K | Recall-weighted balance | weighted harmonic mean (β=2), penalizes missed docs 4× more than false positives |
| MAP | Ranking quality across all queries | average of precision at each relevant doc’s position |
| MRR | How quickly users find a relevant result | 1 ÷ rank of first relevant document |
| NDCG@K | Ranking quality with graded relevance | normalized discounted cumulative gain |
F2 vs F1: Use F2 when missing a relevant document is worse than showing an irrelevant one — the common case in search, recommendations, and discovery. F1 treats both errors equally.
Reading Your Scores
| Score | What It Tells You |
|---|---|
| NDCG@10 = 0.89 | Your top-10 ranking captures 89% of the ideal ordering. Relevant docs appear near the top. |
| Precision@5 = 0.85 | 4–5 of every 5 results are relevant. Users see high-quality results. |
| Recall@20 = 0.95 | You surface 95% of all relevant documents within the top 20. Strong coverage. |
| F2@10 = 0.85 | Recall-weighted balance is strong — few relevant documents are being missed. |
| MRR = 0.93 | The first relevant result typically appears at position 1 or 2. |
| MAP = 0.71 | Overall ranking quality is solid but there’s room to improve ordering. |
Graded Relevance
When you providerelevance_scores in your dataset, NDCG uses graded relevance instead of binary. This distinguishes “exactly right” from “somewhat relevant”:
| Score | Meaning |
|---|---|
| 5 | Perfect match |
| 3–4 | Highly relevant |
| 1–2 | Marginally relevant |
| 0 | Not relevant |
relevance_scores, all metrics use binary relevance (relevant or not).
Comparing Retrievers
Run the same dataset against different retriever configurations to find the best pipeline:metrics_by_k side by side:
| Metric | Baseline | + Reranker | Delta |
|---|---|---|---|
| NDCG@10 | 0.78 | 0.89 | +14% |
| Precision@5 | 0.72 | 0.85 | +18% |
| F2@10 | 0.76 | 0.85 | +12% |
| MRR | 0.81 | 0.93 | +15% |
Ground Truth Datasets
Dataset Requirements
- At least 1 query (aim for 50+ for statistically meaningful results)
- Each query must have at least 1 relevant document
query_inputmust match your retriever’s input schemarelevance_scores, if provided, must cover allrelevant_documents
Managing Datasets
Building Good Datasets
Include query diversity
Cover head queries (popular), torso (moderate), and tail (rare/specific). Don’t just test easy cases.
Use graded relevance
Binary relevant/not-relevant misses nuance. Score documents 0–5 so NDCG can distinguish good rankings from great ones.
Match real traffic
Sample queries from production logs. Synthetic queries test what you think users ask, not what they actually ask.
Version your datasets
Keep datasets stable across evaluations so you can track metric trends over time. Create new versions for schema changes.
Configuration Reference
Name of the ground truth dataset to evaluate against.
Cutoff positions for @K metrics. Include the K values that match your UI — if you show 10 results per page, include
10.Metrics to compute. Available:
precision, recall, f1, f2, map, ndcg, mrr.Related
- Retriever Benchmarks — replay live sessions against candidate retrievers
- Improve Relevance — interaction signals, fusion strategies, and the feedback loop
- API Reference: Run Evaluation — full endpoint specification
- API Reference: Create Dataset — dataset creation endpoint

