Evaluations

Evaluations run your retriever against a curated set of queries with known-relevant documents, then compute standard information retrieval metrics at multiple cutoff points. Use them to quantify retriever quality, compare configurations, and catch regressions before they reach production.

Quickstart

Create a ground truth dataset

Each query pairs an input with the document IDs that should be returned.

curl -X POST "$MP_API_URL/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "product-search-golden",
    "queries": [
      {
        "query_id": "q1",
        "query_input": {"query": "wireless earbuds"},
        "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"],
        "relevance_scores": {"doc_a1": 5, "doc_a2": 3, "doc_a3": 2}
      },
      {
        "query_id": "q2",
        "query_input": {"query": "noise canceling headphones"},
        "relevant_documents": ["doc_b1", "doc_b4", "doc_b7"]
      }
    ]
  }'

dataset = client.retrievers.evaluations.create_dataset(
    dataset_name="product-search-golden",
    queries=[
        {
            "query_id": "q1",
            "query_input": {"query": "wireless earbuds"},
            "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"],
            "relevance_scores": {"doc_a1": 5, "doc_a2": 3, "doc_a3": 2},
        },
        {
            "query_id": "q2",
            "query_input": {"query": "noise canceling headphones"},
            "relevant_documents": ["doc_b1", "doc_b4", "doc_b7"],
        },
    ],
)

const dataset = await client.retrievers.evaluations.createDataset({
  datasetName: "product-search-golden",
  queries: [
    {
      queryId: "q1",
      queryInput: { query: "wireless earbuds" },
      relevantDocuments: ["doc_a1", "doc_a2", "doc_a3"],
      relevanceScores: { doc_a1: 5, doc_a2: 3, doc_a3: 2 },
    },
    {
      queryId: "q2",
      queryInput: { query: "noise canceling headphones" },
      relevantDocuments: ["doc_b1", "doc_b4", "doc_b7"],
    },
  ],
});

Run the evaluation

Execute your retriever against every query in the dataset. The evaluation runs asynchronously and returns a task_id for progress tracking.

curl -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "product-search-golden",
    "evaluation_config": {
      "k_values": [1, 5, 10, 20],
      "metrics": ["precision", "recall", "f1", "f2", "ndcg", "map", "mrr"]
    }
  }'

evaluation = client.retrievers.evaluations.run(
    retriever_id="ret_abc123",
    dataset_name="product-search-golden",
    evaluation_config={
        "k_values": [1, 5, 10, 20],
        "metrics": ["precision", "recall", "f1", "f2", "ndcg", "map", "mrr"],
    },
)
# evaluation.task_id, evaluation.evaluation_id

const evaluation = await client.retrievers.evaluations.run({
  retrieverId: "ret_abc123",
  datasetName: "product-search-golden",
  evaluationConfig: {
    kValues: [1, 5, 10, 20],
    metrics: ["precision", "recall", "f1", "f2", "ndcg", "map", "mrr"],
  },
});
// evaluation.taskId, evaluation.evaluationId

Get results

Poll the evaluation until status is completed.

curl "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations/{evaluation_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

Example response

{
  "evaluation_id": "eval_abc123",
  "status": "completed",
  "query_count": 50,
  "overall_metrics": {
    "precision_at_5": 0.85,
    "recall_at_5": 0.72,
    "f1_at_5": 0.78,
    "f2_at_5": 0.74,
    "ndcg_at_5": 0.81,
    "map": 0.71,
    "mrr": 0.93
  },
  "metrics_by_k": {
    "1":  {"precision": 0.90, "recall": 0.18, "f1": 0.30, "f2": 0.21, "ndcg": 0.90},
    "5":  {"precision": 0.85, "recall": 0.72, "f1": 0.78, "f2": 0.74, "ndcg": 0.81},
    "10": {"precision": 0.75, "recall": 0.88, "f1": 0.81, "f2": 0.85, "ndcg": 0.89},
    "20": {"precision": 0.62, "recall": 0.95, "f1": 0.75, "f2": 0.86, "ndcg": 0.91}
  }
}

Metrics

Every metric is computed at each K value you specify. The defaults cover most use cases:

Metric	What It Measures	Formula
Precision@K	Accuracy of top results	relevant in top K ÷ K
Recall@K	Coverage of relevant documents	relevant in top K ÷ total relevant
F1@K	Balanced precision/recall	harmonic mean of P and R
F2@K	Recall-weighted balance	weighted harmonic mean (β=2), penalizes missed docs 4× more than false positives
MAP	Ranking quality across all queries	average of precision at each relevant doc’s position
MRR	How quickly users find a relevant result	1 ÷ rank of first relevant document
NDCG@K	Ranking quality with graded relevance	normalized discounted cumulative gain

F2 vs F1: Use F2 when missing a relevant document is worse than showing an irrelevant one — the common case in search, recommendations, and discovery. F1 treats both errors equally.

Reading Your Scores

Score	What It Tells You
NDCG@10 = 0.89	Your top-10 ranking captures 89% of the ideal ordering. Relevant docs appear near the top.
Precision@5 = 0.85	4–5 of every 5 results are relevant. Users see high-quality results.
Recall@20 = 0.95	You surface 95% of all relevant documents within the top 20. Strong coverage.
F2@10 = 0.85	Recall-weighted balance is strong — few relevant documents are being missed.
MRR = 0.93	The first relevant result typically appears at position 1 or 2.
MAP = 0.71	Overall ranking quality is solid but there’s room to improve ordering.

Graded Relevance

When you provide relevance_scores in your dataset, NDCG uses graded relevance instead of binary. This distinguishes “exactly right” from “somewhat relevant”:

{
  "query_id": "q1",
  "query_input": {"query": "wireless earbuds"},
  "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"],
  "relevance_scores": {
    "doc_a1": 5,
    "doc_a2": 3,
    "doc_a3": 1
  }
}

Score	Meaning
5	Perfect match
3–4	Highly relevant
1–2	Marginally relevant
0	Not relevant

Without relevance_scores, all metrics use binary relevance (relevant or not).

Comparing Retrievers

Run the same dataset against different retriever configurations to find the best pipeline:

# Evaluate baseline (vector search only)
curl -X POST "$MP_API_URL/v1/retrievers/ret_baseline/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'

# Evaluate candidate (vector search + reranker)
curl -X POST "$MP_API_URL/v1/retrievers/ret_reranked/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'

Compare the metrics_by_k side by side:

Metric	Baseline	+ Reranker	Delta
NDCG@10	0.78	0.89	+14%
Precision@5	0.72	0.85	+18%
F2@10	0.76	0.85	+12%
MRR	0.81	0.93	+15%

Run the same dataset after every pipeline change — adding stages, swapping models, adjusting fusion weights — to quantify the impact before deploying.

Ground Truth Datasets

Dataset Requirements

At least 1 query (aim for 50+ for statistically meaningful results)
Each query must have at least 1 relevant document
query_input must match your retriever’s input schema
relevance_scores, if provided, must cover all relevant_documents

Managing Datasets

# List all datasets
curl "$MP_API_URL/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

# Get a specific dataset
curl "$MP_API_URL/v1/retrievers/evaluations/datasets/product-search-golden" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

Building Good Datasets

Include query diversity

Cover head queries (popular), torso (moderate), and tail (rare/specific). Don’t just test easy cases.

Use graded relevance

Binary relevant/not-relevant misses nuance. Score documents 0–5 so NDCG can distinguish good rankings from great ones.

Match real traffic

Sample queries from production logs. Synthetic queries test what you think users ask, not what they actually ask.

Version your datasets

Keep datasets stable across evaluations so you can track metric trends over time. Create new versions for schema changes.

Configuration Reference

string

required

Name of the ground truth dataset to evaluate against.

integer[]

default:"[1, 5, 10, 20]"

Cutoff positions for @K metrics. Include the K values that match your UI — if you show 10 results per page, include 10.

string[]

Metrics to compute. Available: precision, recall, f1, f2, map, ndcg, mrr.

Retriever Benchmarks — replay live sessions against candidate retrievers
Improve Relevance — interaction signals, fusion strategies, and the feedback loop
API Reference: Run Evaluation — full endpoint specification
API Reference: Create Dataset — dataset creation endpoint

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Quickstart

Metrics

Reading Your Scores

Graded Relevance

Comparing Retrievers

Ground Truth Datasets

Dataset Requirements

Managing Datasets

Building Good Datasets

Include query diversity

Use graded relevance

Match real traffic

Version your datasets

Configuration Reference

​Quickstart

​Metrics

​Reading Your Scores

​Graded Relevance

​Comparing Retrievers

​Ground Truth Datasets

​Dataset Requirements

​Managing Datasets

​Building Good Datasets

Include query diversity

Use graded relevance

Match real traffic

Version your datasets

​Configuration Reference

​Related

Quickstart

Metrics

Reading Your Scores

Graded Relevance

Comparing Retrievers

Ground Truth Datasets

Dataset Requirements

Managing Datasets

Building Good Datasets

Configuration Reference

Related