NEWVectors or files. Pick a path.Start →
    Agent Perception
    17 min read
    Updated 2026-06-04

    Audio-Visual Retrieval for AI Agents: How to Search What Happened, Not Just What Was Said

    A technical guide to building retrieval over video with sound. Learn chunking, audio-video embeddings, transcript and object channels, fusion, reranking, evaluation, and agent tool design.

    AI Agents
    Audio
    Video
    Multimodal Retrieval
    Agent Perception

    Why Audio-Visual Retrieval Is Different



    Most video search systems start with transcripts. That works when the important evidence is spoken. It fails when the event is visible, audible, or only meaningful when sound and motion are interpreted together.

    Examples:

  1. A support agent needs the clip where a device clicks twice, flashes red, and then shuts down.
  2. A safety agent needs near misses where a forklift horn sounds before a worker steps back.
  3. A media agent needs the crowd reaction after a shot, not just the commentator saying it was a shot.
  4. A robotics agent needs the moment a motor begins squealing before the arm stalls.


  5. In all four cases, the evidence is not a document. It is an observation with time, pixels, sound, speech, objects, and source lineage. The agent does not need a vague summary. It needs a bounded retrieval tool that returns exact clips and explains which signals matched.

    That is the hard gate for this topic: audio-visual retrieval helps an AI agent see, hear, and search unstructured content.

    The Core Mental Model



    Think of video with audio as a stream of observations. Retrieval turns that stream into searchable evidence packets.

    raw media
      -> temporal chunks
      -> feature channels
      -> per-channel indexes
      -> fusion and reranking
      -> evidence packet for the agent
    


    The evidence packet is the unit the agent consumes. It should contain:

  6. source URI
  7. start and end timestamps
  8. transcript span when speech is present
  9. visual frame samples
  10. audio event evidence
  11. object, OCR, face, or speaker metadata when available
  12. feature provenance: model ID, extractor version, score, and stage
  13. neighboring context the agent can request next


  14. This structure is more important than any single model. Models change quickly. The retrieval architecture should keep source media, timestamps, features, and evaluation labels stable enough to survive model upgrades.

    Step 1: Chunk Time Before You Embed



    Continuous video is not directly searchable. You need temporal units.

    Common chunking strategies:

  15. Fixed windows: 2, 5, 10, or 30 second windows with overlap.
  16. Shot boundaries: cuts detected from color histograms, embeddings, optical flow, or scene-change models.
  17. Speaker turns: transcript segments bounded by diarization or ASR timestamps.
  18. Audio events: sound activity windows from energy, spectral change, or audio-event models.
  19. Object tracks: windows derived from when an object appears, moves, disappears, or interacts.


  20. Fixed windows are predictable and easy to batch. Shot boundaries better preserve visual meaning. Speaker turns are useful for conversation search. Audio-event windows catch sounds that ASR ignores. Object tracks give agents spatial continuity.

    In practice, use more than one segmentation layer:

  21. short overlapping windows for precise retrieval
  22. longer scene windows for context
  23. transcript or speaker turns for speech search
  24. object-track windows for grounded visual events


  25. Every feature should carry timestamps. Without timestamps, the agent can retrieve a file but cannot inspect the moment.

    Step 2: Extract Separate Feature Channels



    Audio-visual retrieval is usually multi-index retrieval, not one magic vector. Each channel preserves a different kind of evidence.

    Transcript Channel



    ASR converts speech into text spans. It is strong for named entities, exact phrases, instructions, decisions, and conversational search.

    Weaknesses:

  26. misses non-speech sound
  27. loses tone, timing, and visual context
  28. fails when speech is noisy, multilingual, overlapping, or off-camera
  29. cannot prove that a visible event actually happened


  30. Use transcript search as a high-recall text channel, not as the whole system.

    Audio Embedding Channel



    Audio embeddings represent sound events, music, environmental noise, alarms, mechanical patterns, and speech acoustics. CLAP-style models align audio and text. Newer audio-video models such as PE-AV and WAVE also align audio with visual clips.

    This matters when the query names an event by sound:

  31. "glass breaking"
  32. "siren before impact"
  33. "machine squeal"
  34. "applause gets louder"
  35. "customer sounds frustrated"


  36. An agent should be able to search these sounds even when no one says those words.

    Video Embedding Channel



    Video embeddings preserve motion and scene dynamics. Image embeddings over keyframes are useful, but they miss events that require motion. Video encoders such as V-JEPA 2, VideoPrism, and VideoLLaMA-style models help represent action, motion, and temporal state.

    Use video embeddings for:

  37. actions and gestures
  38. sports plays
  39. manufacturing process steps
  40. camera motion
  41. object movement
  42. before-and-after visual changes


  43. For fast events, short clips matter. For procedures, longer clips matter. Store both when the agent may need precise evidence and surrounding context.

    Object, OCR, Face, and Scene Channels



    Dense embeddings are good at fuzzy similarity. Structured channels are good at constraints.

    Examples:

  44. object: "forklift", "helmet", "red box"
  45. OCR: "E113", "subtotal", "approved"
  46. face: known person or cast member when permitted
  47. scene: "warehouse dock", "checkout counter", "sports court"
  48. metadata: camera ID, tenant, object URI, campaign, date, product SKU


  49. Agents need these channels because tool calls usually include filters. A query like "find clips where a forklift moves near a worker in dock-3 after 8 PM" should not rely only on nearest-neighbor similarity.

    Step 3: Search Channels Independently



    A robust retrieval pipeline starts by searching several channels separately.

    user task
      -> query planner
      -> transcript search
      -> audio embedding search
      -> video embedding search
      -> object/OCR/filter search
      -> candidate pool
    


    Each stage should return candidates with:

  50. evidence ID
  51. timestamp
  52. score
  53. stage name
  54. modality
  55. feature version
  56. source URI


  57. Do not throw away provenance. When the agent later sees a result, it should know whether the match came from a transcript phrase, an audio event, a visual motion pattern, or a structured filter.

    Step 4: Fuse Results Without Pretending Scores Are Comparable



    Scores from transcript search, vector search, object filters, and rerankers are not naturally comparable. A cosine score of 0.31 from an audio model is not the same thing as BM25 score 14 or a visual reranker score 0.72.

    Use rank-based fusion when scores come from different systems.

    Reciprocal Rank Fusion



    Reciprocal Rank Fusion is simple and strong:

    RRF(candidate) = sum over result lists 1 / (k + rank_in_list)
    


    The constant k, often 60, reduces the dominance of the top few positions. RRF works well because it only needs ranks, not calibrated scores.

    Use RRF when:

  58. multiple modalities return candidate lists
  59. score scales are not calibrated
  60. you need a stable first implementation


  61. Weighted Fusion



    Weighted fusion lets the query planner emphasize channels:

  62. speech-heavy query: transcript weight high
  63. sound-event query: audio weight high
  64. action query: video weight high
  65. compliance query: filters and OCR weight high


  66. Weighted fusion is powerful but risky. If weights are hand-tuned globally, one modality can dominate. Track per-query-class metrics so improvements in one class do not hide regressions in another.

    Diversification



    Agents often need evidence variety, not twenty near-duplicates. Maximal Marginal Relevance helps balance relevance and diversity:

    select next = relevance_to_query - lambda * similarity_to_selected_results
    


    Use diversification when returning clips from long videos. It helps the agent inspect different moments before deciding whether to expand context.

    Step 5: Rerank With the Full Question and Evidence



    First-stage retrieval should optimize recall. Reranking should optimize precision.

    Rerankers can inspect richer inputs:

  67. user query
  68. transcript span
  69. keyframe captions
  70. object names
  71. OCR text
  72. audio event labels
  73. neighboring context
  74. source metadata


  75. For multimodal search, reranking can be:

  76. text cross-encoder over query plus transcript and generated captions
  77. vision-language reranker over query plus keyframes
  78. late-interaction document or visual retriever for pages and screenshots
  79. LLM or VLM verifier for small candidate sets


  80. Keep reranking bounded. An agent retrieval tool should return fast enough for iterative use. Rerank top 50, not top 5,000. Cache features and do not send raw video to a VLM unless the candidate set is already small.

    Step 6: Expand Time After Retrieval



    The top result is rarely the exact context an agent needs. A five-second audio event may require the preceding thirty seconds to explain what caused it.

    Use temporal expansion after ranking:

  81. include the matching window
  82. include one or two neighboring windows
  83. include parent scene boundaries
  84. include speaker turn before and after
  85. include object tracks that overlap the time range


  86. Return the match and the context separately. The agent should know what matched and what is surrounding evidence.

    {
      "match": {
        "source_uri": "s3://ops/video/cam-4.mp4",
        "start_sec": 184.0,
        "end_sec": 191.0,
        "matched_modalities": ["audio", "video"],
        "matched_stages": ["pe_av_embedding", "object_filter"]
      },
      "context": {
        "before_sec": 30,
        "after_sec": 20,
        "parent_scene_id": "scene_00042"
      }
    }
    


    This prevents a common failure: the retriever finds the right moment, but the agent answers from too narrow a clip.

    Design the Agent Tool Surface



    An agent should not call a vague "search everything" function. It should call a bounded tool with explicit arguments.

    {
      "tool": "search_audio_visual_evidence",
      "input_schema": {
        "query": "string",
        "collections": ["string"],
        "time_range": {"from": "string", "to": "string"},
        "modalities": ["transcript", "audio", "video", "object", "ocr"],
        "filters": {},
        "top_k": 20,
        "budget_ms": 3000,
        "include_context": true
      }
    }
    


    The output should be evidence, not a final answer:

    {
      "results": [
        {
          "source_uri": "s3://support/call-42.mp4",
          "start_sec": 512.0,
          "end_sec": 526.0,
          "summary": "Customer says the unit clicks, then a red light flashes twice.",
          "matched_modalities": ["transcript", "audio", "video"],
          "scores": {
            "rrf": 0.041,
            "rerank": 0.83
          },
          "provenance": [
            {"stage": "asr_bm25", "rank": 2},
            {"stage": "audio_video_embedding", "rank": 1},
            {"stage": "vl_reranker", "rank": 1}
          ]
        }
      ],
      "next_actions": ["expand_context", "retrieve_frames", "request_human_review"]
    }
    


    This matches the direction of modern agent systems. MCP exposes tools through schemas. LangChain and LlamaIndex agents call tools with structured arguments. OpenAI trace grading and agent evals inspect tool trajectories, not just final answers. Audio-visual retrieval should be built with the same discipline.

    Evaluation: What to Measure



    Do not evaluate audio-visual retrieval with one aggregate relevance score. Break the eval by query class.

    Query Classes



  87. transcript-only: "where does the customer mention a refund"
  88. audio-only: "glass breaking"
  89. visual-only: "operator opens the red panel"
  90. audio-video: "horn sounds as forklift enters aisle"
  91. OCR plus video: "screen shows E113 before shutdown"
  92. object plus speech: "speaker says approved while signing the form"
  93. negative: "there should be no clip where the logo appears"


  94. Metrics



  95. Recall@k by modality
  96. nDCG@k by query class
  97. MRR for exact evidence
  98. temporal IoU between retrieved and labeled time spans
  99. modality attribution accuracy
  100. false positive rate for negative queries
  101. average cost per successful evidence retrieval
  102. p95 latency by stage
  103. stale work cancelled when the agent changes direction


  104. Ablations



    Run ablations before making architecture claims:

  105. transcript only
  106. audio embedding only
  107. video embedding only
  108. transcript plus video
  109. audio plus video
  110. all channels plus reranker


  111. A good system should show where each modality helps. If audio never improves recall on audio-event queries, the audio channel is not pulling its weight. If video improves recall but hurts latency too much, use it only for query classes that need motion.

    Common Failure Modes



    Transcript tunnel vision. The system retrieves what was said, not what happened. Add audio and video query classes to the eval set.

    Clip boundary loss. The event crosses a chunk boundary. Use overlapping windows and parent scene expansion.

    Score scale confusion. Audio, video, and text scores are fused as if they are comparable. Use rank fusion or calibrated per-channel scores.

    Duplicate evidence. Top results all come from adjacent windows around the same moment. Add diversification and collapse overlapping windows.

    Missing provenance. The agent cannot tell why a clip matched. Store stage, modality, rank, score, model, and feature version.

    Over-broad tool calls. The agent searches every collection with every modality. Add required filters, top-k, budget, and cancellation.

    Wrong context width. A retrieved clip is correct but too short for reasoning. Return matched evidence separately from expanded context.

    Mixpeek Implementation Pattern



    In Mixpeek, model the pipeline as feature extraction plus a retriever that searches multiple feature channels. The exact model choices depend on your corpus, but the structure is stable.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="mxp_sk_...")

    mx.collections.ingest( collection_id="field-video", source={"url": "s3://field-recordings/"}, feature_extractors=[ { "name": "scene_segmentation", "version": "v1", "params": { "window_seconds": 6, "overlap_seconds": 2 } }, { "name": "audio_transcription", "version": "v1", "params": {"model_id": "openai/whisper-large-v3"} }, { "name": "audio_embeddings", "version": "v1", "params": {"model_id": "facebook/pe-av-large"} }, { "name": "video_embeddings", "version": "v1", "params": {"model_id": "facebook/vjepa2-vitl-fpc64-256"} }, { "name": "object_detection", "version": "v1", "params": {"model_id": "IDEA-Research/grounding-dino-base"} } ] )


    Then expose a retriever as an agent tool:

    results = mx.retrievers.retrieve(
        retriever_id="field-video-agent-search",
        queries=[
            {
                "type": "text",
                "value": "high pitched motor squeal followed by abrupt arm movement"
            }
        ],
        filters={
            "camera_id": {"in": ["arm-2", "arm-3"]},
            "timestamp": {"gte": "2026-06-01T00:00:00Z"}
        },
        stages=[
            {"name": "transcript_bm25", "top_k": 100},
            {"name": "audio_video_embedding", "top_k": 100},
            {"name": "object_filter", "required": False},
            {"name": "rrf_fusion", "top_k": 50},
            {"name": "multimodal_rerank", "top_k": 20}
        ],
        include_context=True,
        budget_ms=3000
    )
    


    The agent receives timestamped evidence with provenance. It can answer only when the evidence supports the claim, expand context when needed, or ask a human to review uncertain clips.

    Design Checklist



  112. Preserve raw media and source URIs.
  113. Segment into short windows and parent scenes.
  114. Store timestamps on every feature.
  115. Use transcript, audio, video, object, OCR, and metadata channels where relevant.
  116. Search channels independently before fusion.
  117. Use RRF or calibrated fusion across score systems.
  118. Rerank a bounded candidate set.
  119. Collapse duplicates and diversify results.
  120. Return matched evidence separately from expanded context.
  121. Log modality, stage, score, model ID, and extractor version.
  122. Evaluate by query class, not only aggregate quality.
  123. Expose the retriever as a bounded agent tool with filters, limits, budgets, and cancellation.


  124. Key Takeaways



    1. Audio-visual retrieval is about timestamped evidence, not just video summaries.

    2. Transcripts are necessary but incomplete. Agents need audio, video, object, OCR, and metadata channels.

    3. Multi-index retrieval works best when each modality searches independently, then fusion and reranking combine the evidence.

    4. Temporal expansion after retrieval is what lets the agent reason from context instead of isolated clips.

    5. The safest agent interface returns evidence, provenance, budgets, and next actions. It does not hide uncertainty behind a final answer.

    Further Reading



  125. Video Scene Segmentation
  126. Video Temporal Grounding
  127. Multi-Index Search Architecture
  128. Multi-Stage Retrieval
  129. Agent Perception Evals
  130. PE-AV on Hugging Face
  131. WAVE-7B on Hugging Face
  132. MCP tool specification
  133. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs