Audio-Visual Retrieval for AI Agents: How to Search What Happened, Not Just What Was Said

Why Audio-Visual Retrieval Is Different

Most video search systems start with transcripts. That works when the important evidence is spoken. It fails when the event is visible, audible, or only meaningful when sound and motion are interpreted together.

Examples:

A support agent needs the clip where a device clicks twice, flashes red, and then shuts down.

A safety agent needs near misses where a forklift horn sounds before a worker steps back.

A media agent needs the crowd reaction after a shot, not just the commentator saying it was a shot.

A robotics agent needs the moment a motor begins squealing before the arm stalls.

In all four cases, the evidence is not a document. It is an observation with time, pixels, sound, speech, objects, and source lineage. The agent does not need a vague summary. It needs a bounded retrieval tool that returns exact clips and explains which signals matched.

That is the hard gate for this topic: audio-visual retrieval helps an AI agent see, hear, and search unstructured content.

The Core Mental Model

Think of video with audio as a stream of observations. Retrieval turns that stream into searchable evidence packets.

raw media
  -> temporal chunks
  -> feature channels
  -> per-channel indexes
  -> fusion and reranking
  -> evidence packet for the agent

The evidence packet is the unit the agent consumes. It should contain:

source URI

start and end timestamps

transcript span when speech is present

visual frame samples

audio event evidence

object, OCR, face, or speaker metadata when available

feature provenance: model ID, extractor version, score, and stage

neighboring context the agent can request next

This structure is more important than any single model. Models change quickly. The retrieval architecture should keep source media, timestamps, features, and evaluation labels stable enough to survive model upgrades.

Step 1: Chunk Time Before You Embed

Continuous video is not directly searchable. You need temporal units.

Common chunking strategies:

Fixed windows: 2, 5, 10, or 30 second windows with overlap.

Shot boundaries: cuts detected from color histograms, embeddings, optical flow, or scene-change models.

Speaker turns: transcript segments bounded by diarization or ASR timestamps.

Audio events: sound activity windows from energy, spectral change, or audio-event models.

Object tracks: windows derived from when an object appears, moves, disappears, or interacts.

Fixed windows are predictable and easy to batch. Shot boundaries better preserve visual meaning. Speaker turns are useful for conversation search. Audio-event windows catch sounds that ASR ignores. Object tracks give agents spatial continuity.

In practice, use more than one segmentation layer:

short overlapping windows for precise retrieval

longer scene windows for context

transcript or speaker turns for speech search

object-track windows for grounded visual events

Every feature should carry timestamps. Without timestamps, the agent can retrieve a file but cannot inspect the moment.

Step 2: Extract Separate Feature Channels

Audio-visual retrieval is usually multi-index retrieval, not one magic vector. Each channel preserves a different kind of evidence.

Transcript Channel

ASR converts speech into text spans. It is strong for named entities, exact phrases, instructions, decisions, and conversational search.

Weaknesses:

misses non-speech sound

loses tone, timing, and visual context

fails when speech is noisy, multilingual, overlapping, or off-camera

cannot prove that a visible event actually happened

Use transcript search as a high-recall text channel, not as the whole system.

Audio Embedding Channel

Audio embeddings represent sound events, music, environmental noise, alarms, mechanical patterns, and speech acoustics. CLAP-style models align audio and text. Newer audio-video models such as PE-AV and WAVE also align audio with visual clips.

This matters when the query names an event by sound:

"glass breaking"

"siren before impact"

"machine squeal"

"applause gets louder"

"customer sounds frustrated"

An agent should be able to search these sounds even when no one says those words.

Video Embedding Channel

Video embeddings preserve motion and scene dynamics. Image embeddings over keyframes are useful, but they miss events that require motion. Video encoders such as V-JEPA 2, VideoPrism, and VideoLLaMA-style models help represent action, motion, and temporal state.

Use video embeddings for:

actions and gestures

sports plays

manufacturing process steps

camera motion

object movement

before-and-after visual changes

For fast events, short clips matter. For procedures, longer clips matter. Store both when the agent may need precise evidence and surrounding context.

Object, OCR, Face, and Scene Channels

Dense embeddings are good at fuzzy similarity. Structured channels are good at constraints.

Examples:

object: "forklift", "helmet", "red box"

OCR: "E113", "subtotal", "approved"

face: known person or cast member when permitted

scene: "warehouse dock", "checkout counter", "sports court"

metadata: camera ID, tenant, object URI, campaign, date, product SKU

Agents need these channels because tool calls usually include filters. A query like "find clips where a forklift moves near a worker in dock-3 after 8 PM" should not rely only on nearest-neighbor similarity.

Step 3: Search Channels Independently

A robust retrieval pipeline starts by searching several channels separately.

user task
  -> query planner
  -> transcript search
  -> audio embedding search
  -> video embedding search
  -> object/OCR/filter search
  -> candidate pool

Each stage should return candidates with:

evidence ID

timestamp

score

stage name

modality

feature version

source URI

Do not throw away provenance. When the agent later sees a result, it should know whether the match came from a transcript phrase, an audio event, a visual motion pattern, or a structured filter.

Step 4: Fuse Results Without Pretending Scores Are Comparable

Scores from transcript search, vector search, object filters, and rerankers are not naturally comparable. A cosine score of 0.31 from an audio model is not the same thing as BM25 score 14 or a visual reranker score 0.72.

Use rank-based fusion when scores come from different systems.

Reciprocal Rank Fusion

Reciprocal Rank Fusion is simple and strong:

RRF(candidate) = sum over result lists 1 / (k + rank_in_list)

The constant k, often 60, reduces the dominance of the top few positions. RRF works well because it only needs ranks, not calibrated scores.

Use RRF when:

multiple modalities return candidate lists

score scales are not calibrated

you need a stable first implementation

Weighted Fusion

Weighted fusion lets the query planner emphasize channels:

speech-heavy query: transcript weight high

sound-event query: audio weight high

action query: video weight high

compliance query: filters and OCR weight high

Weighted fusion is powerful but risky. If weights are hand-tuned globally, one modality can dominate. Track per-query-class metrics so improvements in one class do not hide regressions in another.

Diversification

Agents often need evidence variety, not twenty near-duplicates. Maximal Marginal Relevance helps balance relevance and diversity:

select next = relevance_to_query - lambda * similarity_to_selected_results

Use diversification when returning clips from long videos. It helps the agent inspect different moments before deciding whether to expand context.

Step 5: Rerank With the Full Question and Evidence

First-stage retrieval should optimize recall. Reranking should optimize precision.

Rerankers can inspect richer inputs:

user query

transcript span

keyframe captions

object names

OCR text

audio event labels

neighboring context

source metadata

For multimodal search, reranking can be:

text cross-encoder over query plus transcript and generated captions

vision-language reranker over query plus keyframes

late-interaction document or visual retriever for pages and screenshots

LLM or VLM verifier for small candidate sets

Keep reranking bounded. An agent retrieval tool should return fast enough for iterative use. Rerank top 50, not top 5,000. Cache features and do not send raw video to a VLM unless the candidate set is already small.

Step 6: Expand Time After Retrieval

The top result is rarely the exact context an agent needs. A five-second audio event may require the preceding thirty seconds to explain what caused it.

Use temporal expansion after ranking:

include the matching window

include one or two neighboring windows

include parent scene boundaries

include speaker turn before and after

include object tracks that overlap the time range

Return the match and the context separately. The agent should know what matched and what is surrounding evidence.

{
  "match": {
    "source_uri": "s3://ops/video/cam-4.mp4",
    "start_sec": 184.0,
    "end_sec": 191.0,
    "matched_modalities": ["audio", "video"],
    "matched_stages": ["pe_av_embedding", "object_filter"]
  },
  "context": {
    "before_sec": 30,
    "after_sec": 20,
    "parent_scene_id": "scene_00042"
  }
}

This prevents a common failure: the retriever finds the right moment, but the agent answers from too narrow a clip.

Design the Agent Tool Surface

An agent should not call a vague "search everything" function. It should call a bounded tool with explicit arguments.

{
  "tool": "search_audio_visual_evidence",
  "input_schema": {
    "query": "string",
    "collections": ["string"],
    "time_range": {"from": "string", "to": "string"},
    "modalities": ["transcript", "audio", "video", "object", "ocr"],
    "filters": {},
    "top_k": 20,
    "budget_ms": 3000,
    "include_context": true
  }
}

The output should be evidence, not a final answer:

{
  "results": [
    {
      "source_uri": "s3://support/call-42.mp4",
      "start_sec": 512.0,
      "end_sec": 526.0,
      "summary": "Customer says the unit clicks, then a red light flashes twice.",
      "matched_modalities": ["transcript", "audio", "video"],
      "scores": {
        "rrf": 0.041,
        "rerank": 0.83
      },
      "provenance": [
        {"stage": "asr_bm25", "rank": 2},
        {"stage": "audio_video_embedding", "rank": 1},
        {"stage": "vl_reranker", "rank": 1}
      ]
    }
  ],
  "next_actions": ["expand_context", "retrieve_frames", "request_human_review"]
}

This matches the direction of modern agent systems. MCP exposes tools through schemas. LangChain and LlamaIndex agents call tools with structured arguments. OpenAI trace grading and agent evals inspect tool trajectories, not just final answers. Audio-visual retrieval should be built with the same discipline.

Evaluation: What to Measure

Do not evaluate audio-visual retrieval with one aggregate relevance score. Break the eval by query class.

Query Classes

transcript-only: "where does the customer mention a refund"

audio-only: "glass breaking"

visual-only: "operator opens the red panel"

audio-video: "horn sounds as forklift enters aisle"

OCR plus video: "screen shows E113 before shutdown"

object plus speech: "speaker says approved while signing the form"

negative: "there should be no clip where the logo appears"

Metrics

Recall@k by modality

nDCG@k by query class

MRR for exact evidence

temporal IoU between retrieved and labeled time spans

modality attribution accuracy

false positive rate for negative queries

average cost per successful evidence retrieval

p95 latency by stage

stale work cancelled when the agent changes direction

Ablations

Run ablations before making architecture claims:

transcript only

audio embedding only

video embedding only

transcript plus video

audio plus video

all channels plus reranker

A good system should show where each modality helps. If audio never improves recall on audio-event queries, the audio channel is not pulling its weight. If video improves recall but hurts latency too much, use it only for query classes that need motion.

Common Failure Modes

Transcript tunnel vision. The system retrieves what was said, not what happened. Add audio and video query classes to the eval set.

Clip boundary loss. The event crosses a chunk boundary. Use overlapping windows and parent scene expansion.

Score scale confusion. Audio, video, and text scores are fused as if they are comparable. Use rank fusion or calibrated per-channel scores.

Duplicate evidence. Top results all come from adjacent windows around the same moment. Add diversification and collapse overlapping windows.

Missing provenance. The agent cannot tell why a clip matched. Store stage, modality, rank, score, model, and feature version.

Over-broad tool calls. The agent searches every collection with every modality. Add required filters, top-k, budget, and cancellation.

Wrong context width. A retrieved clip is correct but too short for reasoning. Return matched evidence separately from expanded context.

Mixpeek Implementation Pattern

In Mixpeek, model the pipeline as feature extraction plus a retriever that searches multiple feature channels. The exact model choices depend on your corpus, but the structure is stable.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="mxp_sk_...")

mx.collections.create(
    namespace_id="my-namespace",
    collection_name="my-collection",
    source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
    feature_extractor={"feature_extractor_name": "scene_segmentation", "version": "v1"},
)

Then expose a retriever as an agent tool:

results = mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="high pitched motor squeal followed by abrupt arm movement",
)

The agent receives timestamped evidence with provenance. It can answer only when the evidence supports the claim, expand context when needed, or ask a human to review uncertain clips.

Design Checklist

Preserve raw media and source URIs.

Segment into short windows and parent scenes.

Store timestamps on every feature.

Use transcript, audio, video, object, OCR, and metadata channels where relevant.

Search channels independently before fusion.

Use RRF or calibrated fusion across score systems.

Rerank a bounded candidate set.

Collapse duplicates and diversify results.

Return matched evidence separately from expanded context.

Log modality, stage, score, model ID, and extractor version.

Evaluate by query class, not only aggregate quality.

Expose the retriever as a bounded agent tool with filters, limits, budgets, and cancellation.

Key Takeaways

1. Audio-visual retrieval is about timestamped evidence, not just video summaries.

2. Transcripts are necessary but incomplete. Agents need audio, video, object, OCR, and metadata channels.

3. Multi-index retrieval works best when each modality searches independently, then fusion and reranking combine the evidence.

4. Temporal expansion after retrieval is what lets the agent reason from context instead of isolated clips.

5. The safest agent interface returns evidence, provenance, budgets, and next actions. It does not hide uncertainty behind a final answer.