Why Agent Perception Needs Its Own Evals
Most agent evals ask whether the final answer is correct. Most retrieval evals ask whether the right document appeared in the top k. Neither test is enough when an agent is supposed to see, hear, or search unstructured content.
An agent perception system has to do more than answer:
That is why perception evals sit between retrieval evals and full agent evals. They measure the whole path from raw media to grounded agent behavior.
The hard gate for this topic is direct: agent perception evals help AI agents see, hear, and search unstructured content reliably.
Three Evaluation Levels
A production agent that searches media should be evaluated at three levels.
1. Retrieval Quality
Retrieval quality asks: did the search system return the right evidence?
Typical metrics:
This is the layer covered by traditional retrieval evaluation systems. It is necessary, but it misses agent-specific failures. A retriever can return the right clip while the agent still ignores it, over-calls a tool, cites the wrong timestamp, or spends ten times the intended budget.
2. Tool Trajectory
Tool trajectory asks: did the agent search in the right way?
OpenAI trace grading, LangChain AgentEvals, LangSmith, and similar systems all point in the same direction: agent quality depends on the sequence of model calls, tool calls, guardrails, and intermediate decisions, not only on the final text answer.
For perception systems, the trajectory includes:
This catches failures that ranking metrics alone cannot see.
3. Evidence Grounding
Evidence grounding asks: is the final answer supported by the retrieved media evidence?
For text RAG, grounding often means citation to a document chunk. For multimodal agents, grounding is richer:
An agent perception eval should fail an answer that cites the right file but the wrong timestamp. It should fail an answer that says a speaker made a claim without identifying the speaker turn. It should fail an answer that describes an object without a frame, crop, or detector signal.
The Failure Taxonomy
Build evals around failure modes. A useful perception eval suite should tell you where the system failed, not only that it failed.
Ingestion Failure
The raw file arrived, but the searchable observations are missing or incomplete.
Examples:
Metric examples:
Representation Failure
The media was processed, but the representation does not preserve the signal needed for search.
Examples:
Metric examples:
Retrieval Failure
The right evidence exists in the index, but the search system does not retrieve it.
Examples:
Metric examples:
Tool Behavior Failure
The retriever is capable, but the agent uses it poorly.
Examples:
Metric examples:
Grounding Failure
The final answer is not tied to the returned evidence.
Examples:
Metric examples:
Build a Perception Eval Dataset
Do not start with one giant benchmark. Start with a compact, diagnostic dataset that covers the agent's real jobs.
A useful perception eval dataset contains query classes like:
Each query should identify the expected evidence, not only the expected answer.
{
"query_id": "q_cross_modal_017",
"query_input": {
"text": "Find the moment where the customer says the device is flashing red and the red light is visible"
},
"expected_evidence": [
{
"source_uri": "s3://support-calls/call-42.mp4",
"start_time": "00:08:28",
"end_time": "00:08:42",
"modalities": ["transcript", "visual"],
"required_fields": ["timestamp", "speaker", "keyframe", "source_uri"],
"relevance": 5
}
],
"negative_evidence": [
{
"source_uri": "s3://support-calls/call-42.mp4",
"start_time": "00:04:10",
"end_time": "00:04:30",
"reason": "Transcript mentions red, but no visual red light is present"
}
]
}
The negative evidence matters. It teaches the eval to distinguish a real cross-modal match from a coincidental keyword hit.
Metrics That Actually Diagnose Perception
Use standard retrieval metrics, but add perception-specific measures.
Coverage Metrics
Coverage metrics answer whether the corpus is searchable at all.
Coverage is an upstream gate. If only 60 percent of videos have transcripts, no agent eval can rescue audio search quality.
Retrieval Metrics
Retrieval metrics answer whether the right evidence is returned.
For multimodal retrieval, compute metrics by query class. A single aggregate nDCG score can hide that visual queries improved while OCR queries regressed.
Localization Metrics
Localization metrics answer whether the system found the right part of the media.
This is where many media systems fail quietly. A clip-level answer can look plausible while being thirty seconds off.
Tool Metrics
Tool metrics answer whether the agent behaved like a competent search user.
These metrics need traces. An output-only eval cannot tell whether the agent got lucky after wasting ten bad tool calls.
Grounding Metrics
Grounding metrics answer whether the final response is evidence-backed.
For high-stakes workflows, grounding should be stricter than answer correctness. A correct answer without inspectable evidence may still be unusable.
Trace-Based Agent Perception Evals
Trace-based evals inspect the agent's path, not just the final response. This is important because a perception agent often succeeds or fails before it writes a final answer.
A good trace record contains:
A perception-specific trace grader can score:
1. Did the agent choose the right tool? 2. Did the query preserve the user's intent? 3. Did it search the necessary modalities? 4. Did it apply the right source, tenant, permission, or namespace filters? 5. Did it inspect evidence before answering? 6. Did it cite exact source locations? 7. Did it avoid unsupported claims?
This aligns with the current agent ecosystem. MCP standardizes how tools and resources are exposed to agents. OpenAI Agents tracing records model calls, tool calls, guardrails, handoffs, and audio spans. LangChain AgentEvals score tool-call trajectories against references or rubrics. LlamaIndex retrieval evals expose standard retrieval metrics like hit rate and MRR. The missing piece is usually the perception-specific rubric.
A Minimal Rubric
Use a rubric that separates retrieval, trajectory, and grounding.
{
"retrieval": {
"correct_evidence_in_top_5": 1,
"correct_modality_used": 1,
"temporal_iou": 0.72
},
"trajectory": {
"right_tool": true,
"filters_correct": true,
"unnecessary_tool_calls": 0,
"budget_exceeded": false
},
"grounding": {
"all_claims_cited": true,
"timestamp_citation_present": true,
"speaker_citation_present": true,
"unsupported_claims": 0
},
"pass": true
}
This rubric is intentionally simple. Add complexity only when it changes engineering decisions.
Offline, Online, and Replay Evals
You need three eval modes.
Offline Golden Sets
Use hand-labeled examples before shipping a model, retriever, extractor, or prompt change.
Best for:
Weakness:
Online Signals
Use production behavior to identify weak areas.
Signals include:
Online signals are noisy. They should not replace labeled evals, but they are the best early warning for drift.
Replay Benchmarks
Replay historical queries and interactions against candidate pipelines before shipping.
Best for:
Replay is especially valuable for agents because the agent's tool path is often sensitive to small ranking changes. If a relevant result drops from rank 3 to rank 18, the agent may never inspect it.
How This Maps to Mixpeek
Mixpeek's perception layer decomposes media into searchable features, and the retriever evaluation framework measures search quality over those features.
A typical flow:
1. Ingest media with the features required by the agent. 2. Create a ground-truth dataset with query inputs and relevant document or feature IDs. 3. Run the retriever evaluation against the target retriever. 4. Track standard ranking metrics. 5. Add trace-level and grounding checks around the agent that calls the retriever. 6. Replay historical sessions when changing models, extractors, or retrieval stages.
Example dataset creation:
curl -X POST "https://api.mixpeek.com/v1/retrievers/evaluations/datasets" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: ns_agent_media" \
-H "Content-Type: application/json" \
-d '{
"dataset_name": "agent_perception_v1",
"description": "Cross-modal video, audio, OCR, and object retrieval evals for an agent tool",
"queries": [
{
"query_id": "q_red_light_visible",
"query_input": {
"query": "customer says the device is flashing red and the red light is visible"
},
"relevant_documents": ["feat_call_42_00_08_28"],
"relevance_scores": {
"feat_call_42_00_08_28": 5,
"feat_call_42_00_08_20": 3
}
}
],
"metadata": {
"requires_modalities": ["transcript", "visual"],
"requires_evidence_fields": ["source_uri", "timestamp", "speaker", "keyframe"]
}
}'
Run the evaluation:
curl -X POST "https://api.mixpeek.com/v1/retrievers/ret_agent_media/evaluations" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: ns_agent_media" \
-H "Content-Type: application/json" \
-d '{
"dataset_name": "agent_perception_v1",
"evaluation_config": {
"k_values": [1, 5, 10, 20],
"metrics": ["precision", "recall", "f1", "map", "ndcg", "mrr"]
}
}'
Then wrap the agent tool with a trace grader:
def grade_perception_trace(trace):
result = {
"right_tool": trace.used_tool("search_media"),
"searched_required_modalities": trace.tool_args_include(
"search_media",
"modalities",
["transcript", "visual"]
),
"budget_exceeded": trace.budget_used_ms > trace.budget_limit_ms,
"has_timestamp_citation": trace.final_answer.has_field("timestamp"),
"unsupported_claims": count_unsupported_claims(trace),
}
result["pass"] = (
result["right_tool"]
and result["searched_required_modalities"]
and not result["budget_exceeded"]
and result["has_timestamp_citation"]
and result["unsupported_claims"] == 0
)
return result
The exact trace library can vary. The principle is the same: score the retrieval result, the tool path, and the final grounded answer separately.
What to Monitor in Production
Perception evals should become production telemetry.
Track:
Alert on regressions that change behavior, not only on infrastructure failures. For example:
Design Checklist
Key Takeaways
1. Agent perception evals measure the path from raw media to grounded agent behavior.
2. Standard retrieval metrics are necessary but incomplete for agents.
3. The best datasets label evidence spans: timestamps, pages, boxes, speakers, and feature IDs.
4. Trace-based grading catches tool-selection, filter, budget, cancellation, and citation failures.
5. Production monitoring should track extraction coverage and retrieval drift before users notice bad answers.
6. A correct final answer is not enough. The agent must show the evidence it used.