Agent Perception Evals: Testing Whether AI Agents Can See, Hear, and Search

Why Agent Perception Needs Its Own Evals

Most agent evals ask whether the final answer is correct. Most retrieval evals ask whether the right document appeared in the top k. Neither test is enough when an agent is supposed to see, hear, or search unstructured content.

An agent perception system has to do more than answer:

Ingest raw media into searchable observations.

Retrieve the right evidence across text, image, video, audio, OCR, objects, speakers, and timestamps.

Select the right tool with the right filters and budgets.

Cite the exact source span, frame, region, or speaker turn.

Avoid acting when the evidence is weak.

Keep working after models, media, tenants, schemas, and access controls change.

That is why perception evals sit between retrieval evals and full agent evals. They measure the whole path from raw media to grounded agent behavior.

The hard gate for this topic is direct: agent perception evals help AI agents see, hear, and search unstructured content reliably.

Three Evaluation Levels

A production agent that searches media should be evaluated at three levels.

1. Retrieval Quality

Retrieval quality asks: did the search system return the right evidence?

Typical metrics:

Recall@k

Precision@k

nDCG@k

MRR

MAP

Hit rate

Temporal IoU for video or audio segments

Box IoU for visual localization

This is the layer covered by traditional retrieval evaluation systems. It is necessary, but it misses agent-specific failures. A retriever can return the right clip while the agent still ignores it, over-calls a tool, cites the wrong timestamp, or spends ten times the intended budget.

2. Tool Trajectory

Tool trajectory asks: did the agent search in the right way?

OpenAI trace grading, LangChain AgentEvals, LangSmith, and similar systems all point in the same direction: agent quality depends on the sequence of model calls, tool calls, guardrails, and intermediate decisions, not only on the final text answer.

For perception systems, the trajectory includes:

Which retrieval tool was selected.

Whether the agent searched the correct modality.

Whether it applied the right filters.

Whether it expanded or narrowed the query appropriately.

Whether it cancelled stale searches.

Whether it requested neighboring context before answering.

Whether it stopped when evidence was insufficient.

This catches failures that ranking metrics alone cannot see.

3. Evidence Grounding

Evidence grounding asks: is the final answer supported by the retrieved media evidence?

For text RAG, grounding often means citation to a document chunk. For multimodal agents, grounding is richer:

Video: start time, end time, source URI, scene ID, keyframe, object boxes.

Audio: transcript span, speaker label, time range, confidence.

Image: object box, OCR region, face region, detected class, feature ID.

Document: page number, table cell, figure region, extracted text span.

An agent perception eval should fail an answer that cites the right file but the wrong timestamp. It should fail an answer that says a speaker made a claim without identifying the speaker turn. It should fail an answer that describes an object without a frame, crop, or detector signal.

The Failure Taxonomy

Build evals around failure modes. A useful perception eval suite should tell you where the system failed, not only that it failed.

Ingestion Failure

The raw file arrived, but the searchable observations are missing or incomplete.

Examples:

A video was indexed without audio transcription.

OCR ran on the wrong frame sample rate.

A document table was flattened into unreadable text.

A face detector missed low-resolution faces.

The collection points to stale model versions after a reindex.

Metric examples:

Extraction coverage by modality.

Percent of files with at least one transcript, caption, OCR span, object, or embedding.

Failed extraction rate by file type.

Average observations per minute of media.

Model version coverage.

Representation Failure

The media was processed, but the representation does not preserve the signal needed for search.

Examples:

The chunk is too long, so a ten-second event is buried inside a five-minute segment.

The embedding model captures visual style but not the object category.

Speaker diarization splits one speaker into multiple identities.

OCR text exists but coordinates are missing.

Video frames are sampled too sparsely to catch fast actions.

Metric examples:

Query coverage by representation type.

Temporal localization error.

Speaker attribution accuracy.

OCR region accuracy.

Segment length distribution.

Retrieval Failure

The right evidence exists in the index, but the search system does not retrieve it.

Examples:

Dense search misses exact product IDs.

BM25 finds the transcript but not the matching frame.

Filters exclude the relevant collection.

Reranking demotes the correct visual result.

Multi-index fusion overweights one modality.

Metric examples:

Recall@5 and Recall@20 by modality.

nDCG@k with graded relevance.

Per-stage contribution to final rank.

False negative rate for known evidence.

Empty-result rate by query class.

Tool Behavior Failure

The retriever is capable, but the agent uses it poorly.

Examples:

The agent searches video captions when it should search transcript.

It repeats the same query instead of inspecting returned evidence.

It ignores budget exhaustion.

It keeps a long-running search alive after the answer is found.

It asks for global search when a namespace filter is required.

Metric examples:

Tool selection accuracy.

Query rewrite quality.

Filter correctness.

Budget adherence.

Stale work ratio.

Cancellation rate for superseded searches.

Grounding Failure

The final answer is not tied to the returned evidence.

Examples:

The answer cites a video but not a timestamp.

It names the wrong speaker.

It describes an object that appears near the clip but not in the selected window.

It summarizes a document page but cites the whole PDF.

It combines evidence from two incompatible sources.

Metric examples:

Supported claim rate.

Citation precision.

Timestamp error.

Bounding-box IoU.

Speaker turn match rate.

Unsupported claim count per answer.

Build a Perception Eval Dataset

Do not start with one giant benchmark. Start with a compact, diagnostic dataset that covers the agent's real jobs.

A useful perception eval dataset contains query classes like:

Visual semantic: "Find the scene where the operator opens the red panel."

Audio semantic: "Find the call where the customer says setup failed."

OCR: "Find frames where the screen shows error code E113."

Object grounded: "Find clips with a forklift near a pedestrian."

Speaker grounded: "Find where Dana explains the refund policy."

Temporal: "Find the moment after the package is scanned but before it is loaded."

Cross-modal: "Find the moment where the speaker says the light is red and the red light is visible."

Negative: "There should be no clip where a person enters the restricted zone."

Adversarial: "Find the logo on the box, not the logo in the background poster."

Access-control: "Search only the customer-approved folder."

Each query should identify the expected evidence, not only the expected answer.

{
  "query_id": "q_cross_modal_017",
  "query_input": {
    "text": "Find the moment where the customer says the device is flashing red and the red light is visible"
  },
  "expected_evidence": [
    {
      "source_uri": "s3://support-calls/call-42.mp4",
      "start_time": "00:08:28",
      "end_time": "00:08:42",
      "modalities": ["transcript", "visual"],
      "required_fields": ["timestamp", "speaker", "keyframe", "source_uri"],
      "relevance": 5
    }
  ],
  "negative_evidence": [
    {
      "source_uri": "s3://support-calls/call-42.mp4",
      "start_time": "00:04:10",
      "end_time": "00:04:30",
      "reason": "Transcript mentions red, but no visual red light is present"
    }
  ]
}

The negative evidence matters. It teaches the eval to distinguish a real cross-modal match from a coincidental keyword hit.

Metrics That Actually Diagnose Perception

Use standard retrieval metrics, but add perception-specific measures.

Coverage Metrics

Coverage metrics answer whether the corpus is searchable at all.

Extraction coverage: percent of files with required features.

Observation density: features per minute of video or audio, per page of document, or per image.

Feature freshness: percent of features produced by the current model and extractor version.

Lineage completeness: percent of observations with source URI, timestamp or page, model ID, and feature URI.

Coverage is an upstream gate. If only 60 percent of videos have transcripts, no agent eval can rescue audio search quality.

Retrieval Metrics

Retrieval metrics answer whether the right evidence is returned.

Recall@k: did any expected evidence appear in the top k?

nDCG@k: did highly relevant evidence rank above weak evidence?

MRR: how quickly does the first correct result appear?

Modality hit rate: did the correct modality contribute to the result?

Stage attribution: which stage produced or removed the relevant result?

For multimodal retrieval, compute metrics by query class. A single aggregate nDCG score can hide that visual queries improved while OCR queries regressed.

Localization Metrics

Localization metrics answer whether the system found the right part of the media.

Temporal IoU: overlap between predicted and expected time ranges.

Timestamp error: absolute difference between predicted and expected start time.

Bounding-box IoU: overlap between predicted and expected object region.

Page or region accuracy: correct page, table cell, figure, or OCR box.

Speaker turn accuracy: correct speaker and time span.

This is where many media systems fail quietly. A clip-level answer can look plausible while being thirty seconds off.

Tool Metrics

Tool metrics answer whether the agent behaved like a competent search user.

Tool selection accuracy: correct retrieval tool chosen.

Argument correctness: query, filters, modalities, top-k, and budget are appropriate.

Search depth: number of tool calls before sufficient evidence.

Budget adherence: work stays within limits.

Cancellation quality: stale work is cancelled when a better path appears.

Retry discipline: retries are idempotent and do not duplicate writes.

These metrics need traces. An output-only eval cannot tell whether the agent got lucky after wasting ten bad tool calls.

Grounding Metrics

Grounding metrics answer whether the final response is evidence-backed.

Citation precision: cited source spans actually support the answer.

Citation recall: all material claims have evidence.

Unsupported claim rate: claims with no retrieved support.

Evidence completeness: timestamps, speaker labels, frame IDs, boxes, or page numbers are present when needed.

Abstention quality: the agent refuses or asks for review when evidence is insufficient.

For high-stakes workflows, grounding should be stricter than answer correctness. A correct answer without inspectable evidence may still be unusable.

Trace-Based Agent Perception Evals

Trace-based evals inspect the agent's path, not just the final response. This is important because a perception agent often succeeds or fails before it writes a final answer.

A good trace record contains:

The user task.

The selected tool and tool schema.

Query rewrites.

Filters and namespaces.

Retrieval stages used.

Partial results and final results.

Budget used.

Cancellation events.

Evidence passed into the model.

Final answer and citations.

A perception-specific trace grader can score:

1. Did the agent choose the right tool? 2. Did the query preserve the user's intent? 3. Did it search the necessary modalities? 4. Did it apply the right source, tenant, permission, or namespace filters? 5. Did it inspect evidence before answering? 6. Did it cite exact source locations? 7. Did it avoid unsupported claims?

This aligns with the current agent ecosystem. MCP standardizes how tools and resources are exposed to agents. OpenAI Agents tracing records model calls, tool calls, guardrails, handoffs, and audio spans. LangChain AgentEvals score tool-call trajectories against references or rubrics. LlamaIndex retrieval evals expose standard retrieval metrics like hit rate and MRR. The missing piece is usually the perception-specific rubric.

A Minimal Rubric

Use a rubric that separates retrieval, trajectory, and grounding.

{
  "retrieval": {
    "correct_evidence_in_top_5": 1,
    "correct_modality_used": 1,
    "temporal_iou": 0.72
  },
  "trajectory": {
    "right_tool": true,
    "filters_correct": true,
    "unnecessary_tool_calls": 0,
    "budget_exceeded": false
  },
  "grounding": {
    "all_claims_cited": true,
    "timestamp_citation_present": true,
    "speaker_citation_present": true,
    "unsupported_claims": 0
  },
  "pass": true
}

This rubric is intentionally simple. Add complexity only when it changes engineering decisions.

Offline, Online, and Replay Evals

You need three eval modes.

Offline Golden Sets

Use hand-labeled examples before shipping a model, retriever, extractor, or prompt change.

Best for:

Comparing model versions.

Testing chunking strategies.

Validating new extractors.

Checking temporal grounding.

Preventing obvious regressions.

Weakness:

Golden sets get stale if the corpus or user behavior changes.

Online Signals

Use production behavior to identify weak areas.

Signals include:

Clicks, long views, exports, saves, purchases, or human approvals.

Query reformulations.

Empty-result retries.

Abandoned searches.

Human override decisions.

Reviewer disagreement.

Online signals are noisy. They should not replace labeled evals, but they are the best early warning for drift.

Replay Benchmarks

Replay historical queries and interactions against candidate pipelines before shipping.

Best for:

Testing reranker changes.

Comparing hybrid weights.

Measuring latency and cost differences.

Detecting changes that promote or demote interacted results.

Validating that a pipeline change helps real sessions, not only a curated dataset.

Replay is especially valuable for agents because the agent's tool path is often sensitive to small ranking changes. If a relevant result drops from rank 3 to rank 18, the agent may never inspect it.

How This Maps to Mixpeek

Mixpeek's perception layer decomposes media into searchable features, and the retriever evaluation framework measures search quality over those features.

A typical flow:

1. Ingest media with the features required by the agent. 2. Create a ground-truth dataset with query inputs and relevant document or feature IDs. 3. Run the retriever evaluation against the target retriever. 4. Track standard ranking metrics. 5. Add trace-level and grounding checks around the agent that calls the retriever. 6. Replay historical sessions when changing models, extractors, or retrieval stages.

Example dataset creation:

curl -X POST "https://api.mixpeek.com/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: ns_agent_media" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "agent_perception_v1",
    "description": "Cross-modal video, audio, OCR, and object retrieval evals for an agent tool",
    "queries": [
      {
        "query_id": "q_red_light_visible",
        "query_input": {
          "query": "customer says the device is flashing red and the red light is visible"
        },
        "relevant_documents": ["feat_call_42_00_08_28"],
        "relevance_scores": {
          "feat_call_42_00_08_28": 5,
          "feat_call_42_00_08_20": 3
        }
      }
    ],
    "metadata": {
      "requires_modalities": ["transcript", "visual"],
      "requires_evidence_fields": ["source_uri", "timestamp", "speaker", "keyframe"]
    }
  }'

Run the evaluation:

curl -X POST "https://api.mixpeek.com/v1/retrievers/ret_agent_media/evaluations" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: ns_agent_media" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "agent_perception_v1",
    "evaluation_config": {
      "k_values": [1, 5, 10, 20],
      "metrics": ["precision", "recall", "f1", "map", "ndcg", "mrr"]
    }
  }'

Then wrap the agent tool with a trace grader:

def grade_perception_trace(trace):
    result = {
        "right_tool": trace.used_tool("search_media"),
        "searched_required_modalities": trace.tool_args_include(
            "search_media",
            "modalities",
            ["transcript", "visual"]
        ),
        "budget_exceeded": trace.budget_used_ms > trace.budget_limit_ms,
        "has_timestamp_citation": trace.final_answer.has_field("timestamp"),
        "unsupported_claims": count_unsupported_claims(trace),
    }

    result["pass"] = (
        result["right_tool"]
        and result["searched_required_modalities"]
        and not result["budget_exceeded"]
        and result["has_timestamp_citation"]
        and result["unsupported_claims"] == 0
    )
    return result

The exact trace library can vary. The principle is the same: score the retrieval result, the tool path, and the final grounded answer separately.

What to Monitor in Production

Perception evals should become production telemetry.

Track:

Extraction failure rate by feature extractor and file type.

Percent of new media with required features.

Empty-result rate by retriever.

Query class distribution.

Recall and nDCG on scheduled golden sets.

Replay benchmark deltas before pipeline changes.

Unsupported claim rate from trace graders.

Percent of answers with source, timestamp, page, region, or speaker citations.

Cost per successful grounded answer.

P95 and p99 latency by retrieval stage.

Cancellation and retry rates.

Alert on regressions that change behavior, not only on infrastructure failures. For example:

Video transcript coverage drops below 95 percent.

Cross-modal eval Recall@10 drops more than 5 percent.

Agent tool calls per task double after a prompt change.

Timestamp citation precision falls below the review threshold.

Replay benchmark shows interacted results demoted.

Design Checklist

Define query classes before choosing metrics.

Label evidence spans, not only answer text.

Include negative examples and near misses.

Track ingestion coverage before retrieval quality.

Compute retrieval metrics by modality and query class.

Add localization metrics for timestamps, boxes, pages, and speaker turns.

Grade tool trajectories, not only final answers.

Require exact evidence citations for media claims.

Separate offline golden-set evals from online signals.

Replay historical sessions before changing models, extractors, chunking, or fusion weights.

Monitor cost, latency, stale work, and cancellation alongside quality.

Key Takeaways

1. Agent perception evals measure the path from raw media to grounded agent behavior.

2. Standard retrieval metrics are necessary but incomplete for agents.

3. The best datasets label evidence spans: timestamps, pages, boxes, speakers, and feature IDs.

4. Trace-based grading catches tool-selection, filter, budget, cancellation, and citation failures.

5. Production monitoring should track extraction coverage and retrieval drift before users notice bad answers.

6. A correct final answer is not enough. The agent must show the evidence it used.

Why Agent Perception Needs Its Own Evals

Three Evaluation Levels

1. Retrieval Quality

2. Tool Trajectory

3. Evidence Grounding

The Failure Taxonomy

Ingestion Failure

Representation Failure

Retrieval Failure

Tool Behavior Failure

Grounding Failure

Build a Perception Eval Dataset

Metrics That Actually Diagnose Perception

Coverage Metrics

Retrieval Metrics

Localization Metrics

Tool Metrics

Grounding Metrics

Trace-Based Agent Perception Evals

A Minimal Rubric

Offline, Online, and Replay Evals

Offline Golden Sets

Online Signals

Replay Benchmarks

How This Maps to Mixpeek

What to Monitor in Production

Design Checklist

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

MCP Tool Design for Multimodal Search

Computer-Use Agent Memory: How to Search Screens, Tools, and UI State

Audio-Visual Retrieval for AI Agents: How to Search What Happened, Not Just What Was Said