The Problem: Agents Cannot Search Raw Media
An AI agent can call tools, plan multi-step tasks, and summarize retrieved context. But if the underlying evidence is a raw MP4, a scanned PDF, a product image, or a two-hour call recording, the agent still has a perception problem.
Raw media is not queryable. A vector for the whole file is usually too coarse. A transcript alone misses visual evidence. A caption alone hides timestamps, coordinates, confidence, and source lineage. A human can scrub a video and notice that a logo appears at 00:43, behind the speaker, for two seconds. An agent needs that same observation represented as data.
The core architecture is object decomposition plus layered indexing:
1. Decompose each source object into smaller evidence units. 2. Extract observations from each unit. 3. Store those observations with provenance. 4. Build multiple indexes over the same evidence. 5. Let the agent retrieve, filter, join, and cite the evidence instead of guessing from a blob.
This guide is vendor-neutral until the final implementation section. The concepts apply whether you build on open-source models, cloud APIs, or a managed multimodal pipeline.
The Data Model: Source, Segment, Observation, Feature
A useful multimodal retrieval system separates four levels of data.
| Level | What it represents | Examples | Why agents need it |
| Source object | The original asset | video.mp4, call.wav, invoice.pdf, image.jpg | Traceability and permissions |
| Segment | A bounded region of the source | scene, shot, page, audio turn, crop | Retrieval granularity |
| Observation | A model-produced fact about a segment | object box, transcript span, OCR token, face, caption | Structured evidence |
| Feature | A searchable representation of an observation | vector, BM25 text, label, timestamp, bounding box | Indexing and ranking |
The observation is the atomic unit of agent perception.
Step 1: Decompose the Source Object
Decomposition converts a large file into searchable units. The correct unit depends on the modality and the question patterns you expect.
Video
Common video decomposition strategies:
Fixed windows are simple but often split events in the wrong place. Shot boundaries are better for visual retrieval because they align with editing changes. Semantic scenes are better for agents because they preserve context: who appears, what happens, what is spoken, and what objects are present.
Audio
Audio is usually decomposed by:
For agent retrieval, diarized transcript spans are more useful than a single transcript string. "What did the customer say after the pricing objection?" requires speaker turns and ordering, not just semantic similarity.
Documents
Documents decompose into:
The page is often too coarse. A table cell, chart axis, or clause can be the real evidence. A document agent should retrieve the smallest unit that can support the answer, then expand to the surrounding page or section for context.
Images
Images decompose into:
Whole-image search works for broad visual similarity. It fails when the query is about a small region: "find images where a warning label appears on the lower-right corner." That query needs OCR plus coordinates.
Step 2: Extract Observations, Not Just Captions
A caption is useful, but it is not enough. Good multimodal systems extract multiple observation types from the same segment.
For a video scene, the observation set might include:
Each observation should preserve where it came from. A minimal schema looks like this:
{
"observation_id": "obs_9df",
"source_id": "video_123",
"segment_id": "scene_014",
"modality": "visual",
"type": "object_detection",
"label": "safety helmet",
"confidence": 0.94,
"time": { "start_ms": 42100, "end_ms": 43800 },
"region": { "x": 0.62, "y": 0.18, "w": 0.14, "h": 0.21 },
"model": {
"name": "open_vocabulary_detector",
"version": "2026-05-01"
}
}
This schema gives the agent something concrete to use. It can cite a timestamp, crop a region, ask a follow-up model to inspect the object, or join this observation to nearby transcript spans.
Step 3: Build Layered Indexes
One index cannot answer every multimodal question. A production system usually needs several indexes over the same observations.
Dense Vector Index
Dense embeddings support semantic similarity:
Dense vectors are good for meaning, style, and fuzzy matching. They are weak for exact constraints like dates, part numbers, and explicit policy terms.
Sparse or BM25 Index
Sparse retrieval supports exact lexical matching:
For agent workflows, BM25 is often the difference between plausible results and exact evidence.
Structured Metadata Index
Structured filters handle facts:
The agent should not ask a vector index to enforce "speaker is CFO" or "timestamp between minute 12 and minute 18." Those are structured constraints.
Temporal Index
Video and audio require ordering:
Temporal indexing makes queries like "what happened immediately after the alarm?" possible.
Graph or Lineage Index
Lineage connects derived data back to the original object:
This is what lets an agent cite evidence without losing the source.
Step 4: Query Planning Across Layers
An agent question usually maps to more than one retrieval operation.
Question: "Find clips where the host mentions pricing while the product is visible on screen."
A robust plan:
1. Search transcript spans for "pricing", "cost", "plan", "discount", and related terms. 2. Filter visual observations for product detections or product-like regions. 3. Join transcript spans and visual observations by overlapping time windows. 4. Rank joined scenes by transcript relevance, visual confidence, and temporal overlap. 5. Return clips with timestamps, transcript snippets, and product crop evidence.
This is different from asking a single vector index for the whole query. The query contains at least three constraints:
The system needs layered indexes to satisfy all three.
Step 5: Evidence Assembly for Agents
Agents should receive evidence packets, not raw search rows. A good packet includes:
Example:
{
"evidence_id": "ev_42",
"source": "launch_demo.mp4",
"time": "00:12:41-00:12:58",
"why_matched": [
"transcript mentions pricing twice",
"product package detected in 9 of 12 keyframes"
],
"observations": [
{ "type": "transcript", "text": "the starter plan is priced for small teams" },
{ "type": "object", "label": "product package", "confidence": 0.91 }
],
"follow_up_tools": {
"inspect_clip": "clip_00_12_41",
"open_frame": "frame_7312"
}
}
This format gives the agent grounding. It can answer, ask for another retrieval pass, inspect the clip, or cite the source.
Common Failure Modes
One Vector Per File
A single vector for a 40-minute video or 90-page PDF loses most of the evidence. It can tell you the general topic but not the moment, page, or region that supports an answer.
Captions Without Provenance
Caption strings are lossy. If a caption says "a person stands near a machine," the agent still needs the timestamp, frame id, model version, and confidence. Otherwise it cannot audit the claim.
Mixing Model Versions
If embeddings from model v1 and v2 live in the same index, similarity scores become unstable. Store model name and version on every feature. Reindex when the embedding space changes.
Treating Visual Observations as Truth
Object detectors, OCR, diarization, and VLMs all make mistakes. Store confidence, expose uncertainty, and let agents cross-check with other evidence layers.
Ignoring Negative Space
Sometimes the important fact is absence: no helmet, no logo, no disclosure text, no expected speaker. Absence requires knowing which observations were attempted, not just which ones were found.
No Expansion Window
The exact matching segment is often too small for reasoning. Return the match plus neighboring context: previous speaker turn, next scene, surrounding paragraph, or adjacent frames.
A Reference Architecture
The architecture below is a practical starting point for multimodal agent perception:
1. Ingest source objects into object storage. 2. Generate segments by modality: scenes, pages, speaker turns, crops. 3. Run extractors on each segment: ASR, OCR, detection, captioning, embeddings. 4. Write observations into a structured store with provenance. 5. Write features into layered indexes: dense, sparse, structured, temporal. 6. Expose retrieval tools that return evidence packets. 7. Let the agent inspect, refine, cite, or escalate based on evidence quality.
The key design principle: extraction is asynchronous and expensive, retrieval is synchronous and cheap. Do not make the agent wait for heavy perception every time it asks a question. Precompute observations, then let the agent query them quickly.
How This Maps to Mixpeek
Mixpeek models the same architecture with collections, feature extractors, retrievers, and namespaces.
A video collection might extract transcript, visual embeddings, object detections, OCR, and scene captions. A retriever can then combine stages:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
results = mx.retrievers.search(
retriever_id="video-evidence-retriever",
query="clips where the host discusses pricing while the product is visible",
pipeline=[
{
"stage_type": "search",
"stage_id": "transcript",
"feature": "transcription",
"limit": 100
},
{
"stage_type": "filter",
"stage_id": "product_visible",
"feature": "object_detection",
"conditions": [
{"field": "label", "operator": "contains", "value": "product"}
]
},
{
"stage_type": "fusion",
"stage_id": "ranked_evidence",
"method": "reciprocal_rank_fusion",
"limit": 20
}
]
)
For teams bringing their own embeddings, MVS can serve the vector layer. For teams that need managed perception, Mixpeek can run the extraction layer and populate the observations and features from the media already stored in object storage.
The important idea is not the vendor. It is the separation of concerns:
Design Checklist
Use this checklist when designing an agent perception layer:
Key Takeaways
1. Agents need observations, not just files. A raw video or PDF is not a useful retrieval unit.
2. The atomic unit of multimodal search is a localized observation with time, space, confidence, model version, and source lineage.
3. Dense vectors are one layer, not the whole system. Sparse, structured, temporal, and lineage indexes are equally important.
4. Multimodal queries often contain relationships: spoken content plus visual content plus temporal overlap. Single-stage search rarely satisfies all constraints.
5. Retrieval tools for agents should return evidence packets that can be inspected and cited.
6. The best perception systems precompute expensive observations and make retrieval fast enough for iterative agent loops.