Skip to main content
Evaluations run your retriever against a curated set of queries with known-relevant documents, then compute standard information retrieval metrics at multiple cutoff points. Use them to quantify retriever quality, compare configurations, and catch regressions before they reach production.

Quickstart

1

Create a ground truth dataset

Each query pairs an input with the document IDs that should be returned.
curl -X POST "$MP_API_URL/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "product-search-golden",
    "queries": [
      {
        "query_id": "q1",
        "query_input": {"query": "wireless earbuds"},
        "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"],
        "relevance_scores": {"doc_a1": 5, "doc_a2": 3, "doc_a3": 2}
      },
      {
        "query_id": "q2",
        "query_input": {"query": "noise canceling headphones"},
        "relevant_documents": ["doc_b1", "doc_b4", "doc_b7"]
      }
    ]
  }'
2

Run the evaluation

Execute your retriever against every query in the dataset. The evaluation runs asynchronously and returns a task_id for progress tracking.
curl -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "product-search-golden",
    "evaluation_config": {
      "k_values": [1, 5, 10, 20],
      "metrics": ["precision", "recall", "f1", "f2", "ndcg", "map", "mrr"]
    }
  }'
3

Get results

Poll the evaluation until status is completed.
curl "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations/{evaluation_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
Example response
{
  "evaluation_id": "eval_abc123",
  "status": "completed",
  "query_count": 50,
  "overall_metrics": {
    "precision_at_5": 0.85,
    "recall_at_5": 0.72,
    "f1_at_5": 0.78,
    "f2_at_5": 0.74,
    "ndcg_at_5": 0.81,
    "map": 0.71,
    "mrr": 0.93
  },
  "metrics_by_k": {
    "1":  {"precision": 0.90, "recall": 0.18, "f1": 0.30, "f2": 0.21, "ndcg": 0.90},
    "5":  {"precision": 0.85, "recall": 0.72, "f1": 0.78, "f2": 0.74, "ndcg": 0.81},
    "10": {"precision": 0.75, "recall": 0.88, "f1": 0.81, "f2": 0.85, "ndcg": 0.89},
    "20": {"precision": 0.62, "recall": 0.95, "f1": 0.75, "f2": 0.86, "ndcg": 0.91}
  }
}

Metrics

Every metric is computed at each K value you specify. The defaults cover most use cases:
MetricWhat It MeasuresFormula
Precision@KAccuracy of top resultsrelevant in top K ÷ K
Recall@KCoverage of relevant documentsrelevant in top K ÷ total relevant
F1@KBalanced precision/recallharmonic mean of P and R
F2@KRecall-weighted balanceweighted harmonic mean (β=2), penalizes missed docs 4× more than false positives
MAPRanking quality across all queriesaverage of precision at each relevant doc’s position
MRRHow quickly users find a relevant result1 ÷ rank of first relevant document
NDCG@KRanking quality with graded relevancenormalized discounted cumulative gain
F2 vs F1: Use F2 when missing a relevant document is worse than showing an irrelevant one — the common case in search, recommendations, and discovery. F1 treats both errors equally.

Reading Your Scores

ScoreWhat It Tells You
NDCG@10 = 0.89Your top-10 ranking captures 89% of the ideal ordering. Relevant docs appear near the top.
Precision@5 = 0.854–5 of every 5 results are relevant. Users see high-quality results.
Recall@20 = 0.95You surface 95% of all relevant documents within the top 20. Strong coverage.
F2@10 = 0.85Recall-weighted balance is strong — few relevant documents are being missed.
MRR = 0.93The first relevant result typically appears at position 1 or 2.
MAP = 0.71Overall ranking quality is solid but there’s room to improve ordering.

Graded Relevance

When you provide relevance_scores in your dataset, NDCG uses graded relevance instead of binary. This distinguishes “exactly right” from “somewhat relevant”:
{
  "query_id": "q1",
  "query_input": {"query": "wireless earbuds"},
  "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"],
  "relevance_scores": {
    "doc_a1": 5,
    "doc_a2": 3,
    "doc_a3": 1
  }
}
ScoreMeaning
5Perfect match
3–4Highly relevant
1–2Marginally relevant
0Not relevant
Without relevance_scores, all metrics use binary relevance (relevant or not).

Comparing Retrievers

Run the same dataset against different retriever configurations to find the best pipeline:
# Evaluate baseline (vector search only)
curl -X POST "$MP_API_URL/v1/retrievers/ret_baseline/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'

# Evaluate candidate (vector search + reranker)
curl -X POST "$MP_API_URL/v1/retrievers/ret_reranked/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'
Compare the metrics_by_k side by side:
MetricBaseline+ RerankerDelta
NDCG@100.780.89+14%
Precision@50.720.85+18%
F2@100.760.85+12%
MRR0.810.93+15%
Run the same dataset after every pipeline change — adding stages, swapping models, adjusting fusion weights — to quantify the impact before deploying.

Ground Truth Datasets

Dataset Requirements

  • At least 1 query (aim for 50+ for statistically meaningful results)
  • Each query must have at least 1 relevant document
  • query_input must match your retriever’s input schema
  • relevance_scores, if provided, must cover all relevant_documents

Managing Datasets

# List all datasets
curl "$MP_API_URL/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

# Get a specific dataset
curl "$MP_API_URL/v1/retrievers/evaluations/datasets/product-search-golden" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

Building Good Datasets

Include query diversity

Cover head queries (popular), torso (moderate), and tail (rare/specific). Don’t just test easy cases.

Use graded relevance

Binary relevant/not-relevant misses nuance. Score documents 0–5 so NDCG can distinguish good rankings from great ones.

Match real traffic

Sample queries from production logs. Synthetic queries test what you think users ask, not what they actually ask.

Version your datasets

Keep datasets stable across evaluations so you can track metric trends over time. Create new versions for schema changes.

Configuration Reference

dataset_name
string
required
Name of the ground truth dataset to evaluate against.
evaluation_config.k_values
integer[]
default:"[1, 5, 10, 20]"
Cutoff positions for @K metrics. Include the K values that match your UI — if you show 10 results per page, include 10.
evaluation_config.metrics
string[]
Metrics to compute. Available: precision, recall, f1, f2, map, ndcg, mrr.