Object Decomposition and Layered Indexing for AI Agent Perception

How a video becomes searchable: split into segments by scene, silence, or time; each segment gets keyframe (CLIP/SigLIP 768D), transcript (Whisper to E5 1024D), face (ArcFace 512D), and on-screen text embeddings, you search moments, not files.

The Problem: Agents Cannot Search Raw Media

An AI agent can call tools, plan multi-step tasks, and summarize retrieved context. But if the underlying evidence is a raw MP4, a scanned PDF, a product image, or a two-hour call recording, the agent still has a perception problem.

Raw media is not queryable. A vector for the whole file is usually too coarse. A transcript alone misses visual evidence. A caption alone hides timestamps, coordinates, confidence, and source lineage. A human can scrub a video and notice that a logo appears at 00:43, behind the speaker, for two seconds. An agent needs that same observation represented as data.

The core architecture is object decomposition plus layered indexing:

1. Decompose each source object into smaller evidence units. 2. Extract observations from each unit. 3. Store those observations with provenance. 4. Build multiple indexes over the same evidence. 5. Let the agent retrieve, filter, join, and cite the evidence instead of guessing from a blob.

This guide is vendor-neutral until the final implementation section. The concepts apply whether you build on open-source models, cloud APIs, or a managed multimodal pipeline.

The Data Model: Source, Segment, Observation, Feature

A useful multimodal retrieval system separates four levels of data.

Level

What it represents

Examples

Why agents need it

Source object	The original asset	video.mp4, call.wav, invoice.pdf, image.jpg	Traceability and permissions
Segment	A bounded region of the source	scene, shot, page, audio turn, crop	Retrieval granularity
Observation	A model-produced fact about a segment	object box, transcript span, OCR token, face, caption	Structured evidence
Feature	A searchable representation of an observation	vector, BM25 text, label, timestamp, bounding box	Indexing and ranking

The mistake is storing only source objects and vectors. Agents need observations because observations carry local evidence:

Time: start_ms and end_ms

Space: page, x, y, width, height

Modality: visual, audio, text, layout

Model provenance: extractor name, version, confidence

Parentage: source object and segment id

The observation is the atomic unit of agent perception.

Step 1: Decompose the Source Object

Decomposition converts a large file into searchable units. The correct unit depends on the modality and the question patterns you expect.

Video

Common video decomposition strategies:

Fixed windows: every 2, 5, or 10 seconds

Shot boundaries: cuts detected by frame difference, histogram distance, or learned shot detectors

Semantic scenes: groups of shots with consistent setting, people, or action

Query-aware windows: dynamic windows selected based on a query or task

Fixed windows are simple but often split events in the wrong place. Shot boundaries are better for visual retrieval because they align with editing changes. Semantic scenes are better for agents because they preserve context: who appears, what happens, what is spoken, and what objects are present.

Audio

Audio is usually decomposed by:

Voice activity detection: speech vs. silence

Speaker diarization: who spoke when

ASR timestamp alignment: word-level or phrase-level timestamps

Acoustic event detection: applause, alarms, music, impacts

For agent retrieval, diarized transcript spans are more useful than a single transcript string. "What did the customer say after the pricing objection?" requires speaker turns and ordering, not just semantic similarity.

Documents

Documents decompose into:

Pages

Layout blocks

Tables

Figures

Form fields

OCR text spans

Captions and footnotes

The page is often too coarse. A table cell, chart axis, or clause can be the real evidence. A document agent should retrieve the smallest unit that can support the answer, then expand to the surrounding page or section for context.

Images

Images decompose into:

Whole-image embeddings

Object boxes

Segmentation masks

Faces

Text regions

Crops around detected regions

Whole-image search works for broad visual similarity. It fails when the query is about a small region: "find images where a warning label appears on the lower-right corner." That query needs OCR plus coordinates.

Step 2: Extract Observations, Not Just Captions

A caption is useful, but it is not enough. Good multimodal systems extract multiple observation types from the same segment.

For a video scene, the observation set might include:

Transcript spans from ASR

Speaker turns from diarization

Keyframe captions from a VLM

Object detections with bounding boxes

OCR tokens from on-screen text

Faces or known identities

Audio events

Visual embeddings for frames and crops

Text embeddings for transcript and captions

Each observation should preserve where it came from. A minimal schema looks like this:

{
  "observation_id": "obs_9df",
  "source_id": "video_123",
  "segment_id": "scene_014",
  "modality": "visual",
  "type": "object_detection",
  "label": "safety helmet",
  "confidence": 0.94,
  "time": { "start_ms": 42100, "end_ms": 43800 },
  "region": { "x": 0.62, "y": 0.18, "w": 0.14, "h": 0.21 },
  "model": {
    "name": "open_vocabulary_detector",
    "version": "2026-05-01"
  }
}

This schema gives the agent something concrete to use. It can cite a timestamp, crop a region, ask a follow-up model to inspect the object, or join this observation to nearby transcript spans.

Step 3: Build Layered Indexes

One index cannot answer every multimodal question. A production system usually needs several indexes over the same observations.

Dense Vector Index

Dense embeddings support semantic similarity:

"clips where a person explains a chart"

"images similar to this moodboard"

"documents discussing revenue risk"

Dense vectors are good for meaning, style, and fuzzy matching. They are weak for exact constraints like dates, part numbers, and explicit policy terms.

Sparse or BM25 Index

Sparse retrieval supports exact lexical matching:

product SKUs

legal clause names

ticker symbols

medical codes

phrases from transcripts

For agent workflows, BM25 is often the difference between plausible results and exact evidence.

Structured Metadata Index

Structured filters handle facts:

object labels

speaker ids

confidence thresholds

timestamps

page numbers

bounding boxes

model versions

permissions

The agent should not ask a vector index to enforce "speaker is CFO" or "timestamp between minute 12 and minute 18." Those are structured constraints.

Temporal Index

Video and audio require ordering:

before and after relationships

overlapping observations

nearest transcript span to a visual event

scene transitions

repeated events across time

Temporal indexing makes queries like "what happened immediately after the alarm?" possible.

Graph or Lineage Index

Lineage connects derived data back to the original object:

crop belongs to frame

frame belongs to scene

scene belongs to video

transcript span overlaps scene

detected logo overlaps product crop

This is what lets an agent cite evidence without losing the source.

Step 4: Query Planning Across Layers

An agent question usually maps to more than one retrieval operation.

Question: "Find clips where the host mentions pricing while the product is visible on screen."

A robust plan:

1. Search transcript spans for "pricing", "cost", "plan", "discount", and related terms. 2. Filter visual observations for product detections or product-like regions. 3. Join transcript spans and visual observations by overlapping time windows. 4. Rank joined scenes by transcript relevance, visual confidence, and temporal overlap. 5. Return clips with timestamps, transcript snippets, and product crop evidence.

This is different from asking a single vector index for the whole query. The query contains at least three constraints:

Spoken content: pricing language

Visual content: product visible

Temporal relationship: both happen at the same time

The system needs layered indexes to satisfy all three.

Step 5: Evidence Assembly for Agents

Agents should receive evidence packets, not raw search rows. A good packet includes:

Human-readable summary

Source URL or object id

Timestamp or page reference

Matching observations

Confidence values

Snippets or crops

Enough context before and after the match

Tool-call friendly ids for follow-up inspection

Example:

{
  "evidence_id": "ev_42",
  "source": "launch_demo.mp4",
  "time": "00:12:41-00:12:58",
  "why_matched": [
    "transcript mentions pricing twice",
    "product package detected in 9 of 12 keyframes"
  ],
  "observations": [
    { "type": "transcript", "text": "the starter plan is priced for small teams" },
    { "type": "object", "label": "product package", "confidence": 0.91 }
  ],
  "follow_up_tools": {
    "inspect_clip": "clip_00_12_41",
    "open_frame": "frame_7312"
  }
}

This format gives the agent grounding. It can answer, ask for another retrieval pass, inspect the clip, or cite the source.

Common Failure Modes

One Vector Per File

A single vector for a 40-minute video or 90-page PDF loses most of the evidence. It can tell you the general topic but not the moment, page, or region that supports an answer.

Captions Without Provenance

Caption strings are lossy. If a caption says "a person stands near a machine," the agent still needs the timestamp, frame id, model version, and confidence. Otherwise it cannot audit the claim.

Mixing Model Versions

If embeddings from model v1 and v2 live in the same index, similarity scores become unstable. Store model name and version on every feature. Reindex when the embedding space changes.

Treating Visual Observations as Truth

Object detectors, OCR, diarization, and VLMs all make mistakes. Store confidence, expose uncertainty, and let agents cross-check with other evidence layers.

Ignoring Negative Space

Sometimes the important fact is absence: no helmet, no logo, no disclosure text, no expected speaker. Absence requires knowing which observations were attempted, not just which ones were found.

No Expansion Window

The exact matching segment is often too small for reasoning. Return the match plus neighboring context: previous speaker turn, next scene, surrounding paragraph, or adjacent frames.

A Reference Architecture

The architecture below is a practical starting point for multimodal agent perception:

1. Ingest source objects into object storage. 2. Generate segments by modality: scenes, pages, speaker turns, crops. 3. Run extractors on each segment: ASR, OCR, detection, captioning, embeddings. 4. Write observations into a structured store with provenance. 5. Write features into layered indexes: dense, sparse, structured, temporal. 6. Expose retrieval tools that return evidence packets. 7. Let the agent inspect, refine, cite, or escalate based on evidence quality.

The key design principle: extraction is asynchronous and expensive, retrieval is synchronous and cheap. Do not make the agent wait for heavy perception every time it asks a question. Precompute observations, then let the agent query them quickly.

How This Maps to Mixpeek

Mixpeek models the same architecture with collections, feature extractors, retrievers, and namespaces.

A video collection might extract transcript, visual embeddings, object detections, OCR, and scene captions. A retriever can then combine stages:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

results = mx.retrievers.execute(
    retriever_id="video-evidence-retriever",
    query="clips where the host discusses pricing while the product is visible",
    pipeline=[
        {
            "stage_type": "search",
            "stage_id": "transcript",
            "feature": "transcription",
            "limit": 100
        },
        {
            "stage_type": "filter",
            "stage_id": "product_visible",
            "feature": "object_detection",
            "conditions": [
                {"field": "label", "operator": "contains", "value": "product"}
            ]
        },
        {
            "stage_type": "fusion",
            "stage_id": "ranked_evidence",
            "method": "reciprocal_rank_fusion",
            "limit": 20
        }
    ]
)

For teams bringing their own embeddings, MVS can serve the vector layer. For teams that need managed perception, Mixpeek can run the extraction layer and populate the observations and features from the media already stored in object storage.

The important idea is not the vendor. It is the separation of concerns:

Extract once.

Preserve provenance.

Index each evidence layer correctly.

Let the agent retrieve evidence, not guesses.

Design Checklist

Use this checklist when designing an agent perception layer:

Can every result cite its source object?

Can every visual result cite a timestamp or bounding box?

Can every document result cite a page or layout region?

Are model name and version stored on every feature?

Are confidence scores stored and exposed?

Are dense, sparse, structured, and temporal indexes separated?

Can queries join observations across modalities?

Can the agent request neighboring context?

Can you reindex when an embedding model changes?

Can you audit which extractor produced a claim?

Key Takeaways

1. Agents need observations, not just files. A raw video or PDF is not a useful retrieval unit.

2. The atomic unit of multimodal search is a localized observation with time, space, confidence, model version, and source lineage.

3. Dense vectors are one layer, not the whole system. Sparse, structured, temporal, and lineage indexes are equally important.

4. Multimodal queries often contain relationships: spoken content plus visual content plus temporal overlap. Single-stage search rarely satisfies all constraints.

5. Retrieval tools for agents should return evidence packets that can be inspected and cited.

6. The best perception systems precompute expensive observations and make retrieval fast enough for iterative agent loops.