NEWManaged multimodal retrieval.Explore platform →
    Agent Perception
    18 min read
    Updated 2026-06-01

    Object Decomposition and Layered Indexing for AI Agent Perception

    A practical architecture guide for turning video, audio, images, and documents into searchable evidence that agents can inspect, cite, filter, and reason over.

    Agent Perception
    Multimodal Search
    Video Understanding
    Retrieval Architecture

    The Problem: Agents Cannot Search Raw Media



    An AI agent can call tools, plan multi-step tasks, and summarize retrieved context. But if the underlying evidence is a raw MP4, a scanned PDF, a product image, or a two-hour call recording, the agent still has a perception problem.

    Raw media is not queryable. A vector for the whole file is usually too coarse. A transcript alone misses visual evidence. A caption alone hides timestamps, coordinates, confidence, and source lineage. A human can scrub a video and notice that a logo appears at 00:43, behind the speaker, for two seconds. An agent needs that same observation represented as data.

    The core architecture is object decomposition plus layered indexing:

    1. Decompose each source object into smaller evidence units. 2. Extract observations from each unit. 3. Store those observations with provenance. 4. Build multiple indexes over the same evidence. 5. Let the agent retrieve, filter, join, and cite the evidence instead of guessing from a blob.

    This guide is vendor-neutral until the final implementation section. The concepts apply whether you build on open-source models, cloud APIs, or a managed multimodal pipeline.

    The Data Model: Source, Segment, Observation, Feature



    A useful multimodal retrieval system separates four levels of data.

    LevelWhat it representsExamplesWhy agents need it
    Source objectThe original assetvideo.mp4, call.wav, invoice.pdf, image.jpgTraceability and permissions
    SegmentA bounded region of the sourcescene, shot, page, audio turn, cropRetrieval granularity
    ObservationA model-produced fact about a segmentobject box, transcript span, OCR token, face, captionStructured evidence
    FeatureA searchable representation of an observationvector, BM25 text, label, timestamp, bounding boxIndexing and ranking
    The mistake is storing only source objects and vectors. Agents need observations because observations carry local evidence:

  1. Time: start_ms and end_ms
  2. Space: page, x, y, width, height
  3. Modality: visual, audio, text, layout
  4. Model provenance: extractor name, version, confidence
  5. Parentage: source object and segment id


  6. The observation is the atomic unit of agent perception.

    Step 1: Decompose the Source Object



    Decomposition converts a large file into searchable units. The correct unit depends on the modality and the question patterns you expect.

    Video



    Common video decomposition strategies:

  7. Fixed windows: every 2, 5, or 10 seconds
  8. Shot boundaries: cuts detected by frame difference, histogram distance, or learned shot detectors
  9. Semantic scenes: groups of shots with consistent setting, people, or action
  10. Query-aware windows: dynamic windows selected based on a query or task


  11. Fixed windows are simple but often split events in the wrong place. Shot boundaries are better for visual retrieval because they align with editing changes. Semantic scenes are better for agents because they preserve context: who appears, what happens, what is spoken, and what objects are present.

    Audio



    Audio is usually decomposed by:

  12. Voice activity detection: speech vs. silence
  13. Speaker diarization: who spoke when
  14. ASR timestamp alignment: word-level or phrase-level timestamps
  15. Acoustic event detection: applause, alarms, music, impacts


  16. For agent retrieval, diarized transcript spans are more useful than a single transcript string. "What did the customer say after the pricing objection?" requires speaker turns and ordering, not just semantic similarity.

    Documents



    Documents decompose into:

  17. Pages
  18. Layout blocks
  19. Tables
  20. Figures
  21. Form fields
  22. OCR text spans
  23. Captions and footnotes


  24. The page is often too coarse. A table cell, chart axis, or clause can be the real evidence. A document agent should retrieve the smallest unit that can support the answer, then expand to the surrounding page or section for context.

    Images



    Images decompose into:

  25. Whole-image embeddings
  26. Object boxes
  27. Segmentation masks
  28. Faces
  29. Text regions
  30. Crops around detected regions


  31. Whole-image search works for broad visual similarity. It fails when the query is about a small region: "find images where a warning label appears on the lower-right corner." That query needs OCR plus coordinates.

    Step 2: Extract Observations, Not Just Captions



    A caption is useful, but it is not enough. Good multimodal systems extract multiple observation types from the same segment.

    For a video scene, the observation set might include:

  32. Transcript spans from ASR
  33. Speaker turns from diarization
  34. Keyframe captions from a VLM
  35. Object detections with bounding boxes
  36. OCR tokens from on-screen text
  37. Faces or known identities
  38. Audio events
  39. Visual embeddings for frames and crops
  40. Text embeddings for transcript and captions


  41. Each observation should preserve where it came from. A minimal schema looks like this:

    {
      "observation_id": "obs_9df",
      "source_id": "video_123",
      "segment_id": "scene_014",
      "modality": "visual",
      "type": "object_detection",
      "label": "safety helmet",
      "confidence": 0.94,
      "time": { "start_ms": 42100, "end_ms": 43800 },
      "region": { "x": 0.62, "y": 0.18, "w": 0.14, "h": 0.21 },
      "model": {
        "name": "open_vocabulary_detector",
        "version": "2026-05-01"
      }
    }
    


    This schema gives the agent something concrete to use. It can cite a timestamp, crop a region, ask a follow-up model to inspect the object, or join this observation to nearby transcript spans.

    Step 3: Build Layered Indexes



    One index cannot answer every multimodal question. A production system usually needs several indexes over the same observations.

    Dense Vector Index



    Dense embeddings support semantic similarity:

  42. "clips where a person explains a chart"
  43. "images similar to this moodboard"
  44. "documents discussing revenue risk"


  45. Dense vectors are good for meaning, style, and fuzzy matching. They are weak for exact constraints like dates, part numbers, and explicit policy terms.

    Sparse or BM25 Index



    Sparse retrieval supports exact lexical matching:

  46. product SKUs
  47. legal clause names
  48. ticker symbols
  49. medical codes
  50. phrases from transcripts


  51. For agent workflows, BM25 is often the difference between plausible results and exact evidence.

    Structured Metadata Index



    Structured filters handle facts:

  52. object labels
  53. speaker ids
  54. confidence thresholds
  55. timestamps
  56. page numbers
  57. bounding boxes
  58. model versions
  59. permissions


  60. The agent should not ask a vector index to enforce "speaker is CFO" or "timestamp between minute 12 and minute 18." Those are structured constraints.

    Temporal Index



    Video and audio require ordering:

  61. before and after relationships
  62. overlapping observations
  63. nearest transcript span to a visual event
  64. scene transitions
  65. repeated events across time


  66. Temporal indexing makes queries like "what happened immediately after the alarm?" possible.

    Graph or Lineage Index



    Lineage connects derived data back to the original object:

  67. crop belongs to frame
  68. frame belongs to scene
  69. scene belongs to video
  70. transcript span overlaps scene
  71. detected logo overlaps product crop


  72. This is what lets an agent cite evidence without losing the source.

    Step 4: Query Planning Across Layers



    An agent question usually maps to more than one retrieval operation.

    Question: "Find clips where the host mentions pricing while the product is visible on screen."

    A robust plan:

    1. Search transcript spans for "pricing", "cost", "plan", "discount", and related terms. 2. Filter visual observations for product detections or product-like regions. 3. Join transcript spans and visual observations by overlapping time windows. 4. Rank joined scenes by transcript relevance, visual confidence, and temporal overlap. 5. Return clips with timestamps, transcript snippets, and product crop evidence.

    This is different from asking a single vector index for the whole query. The query contains at least three constraints:

  73. Spoken content: pricing language
  74. Visual content: product visible
  75. Temporal relationship: both happen at the same time


  76. The system needs layered indexes to satisfy all three.

    Step 5: Evidence Assembly for Agents



    Agents should receive evidence packets, not raw search rows. A good packet includes:

  77. Human-readable summary
  78. Source URL or object id
  79. Timestamp or page reference
  80. Matching observations
  81. Confidence values
  82. Snippets or crops
  83. Enough context before and after the match
  84. Tool-call friendly ids for follow-up inspection


  85. Example:

    {
      "evidence_id": "ev_42",
      "source": "launch_demo.mp4",
      "time": "00:12:41-00:12:58",
      "why_matched": [
        "transcript mentions pricing twice",
        "product package detected in 9 of 12 keyframes"
      ],
      "observations": [
        { "type": "transcript", "text": "the starter plan is priced for small teams" },
        { "type": "object", "label": "product package", "confidence": 0.91 }
      ],
      "follow_up_tools": {
        "inspect_clip": "clip_00_12_41",
        "open_frame": "frame_7312"
      }
    }
    


    This format gives the agent grounding. It can answer, ask for another retrieval pass, inspect the clip, or cite the source.

    Common Failure Modes



    One Vector Per File



    A single vector for a 40-minute video or 90-page PDF loses most of the evidence. It can tell you the general topic but not the moment, page, or region that supports an answer.

    Captions Without Provenance



    Caption strings are lossy. If a caption says "a person stands near a machine," the agent still needs the timestamp, frame id, model version, and confidence. Otherwise it cannot audit the claim.

    Mixing Model Versions



    If embeddings from model v1 and v2 live in the same index, similarity scores become unstable. Store model name and version on every feature. Reindex when the embedding space changes.

    Treating Visual Observations as Truth



    Object detectors, OCR, diarization, and VLMs all make mistakes. Store confidence, expose uncertainty, and let agents cross-check with other evidence layers.

    Ignoring Negative Space



    Sometimes the important fact is absence: no helmet, no logo, no disclosure text, no expected speaker. Absence requires knowing which observations were attempted, not just which ones were found.

    No Expansion Window



    The exact matching segment is often too small for reasoning. Return the match plus neighboring context: previous speaker turn, next scene, surrounding paragraph, or adjacent frames.

    A Reference Architecture



    The architecture below is a practical starting point for multimodal agent perception:

    1. Ingest source objects into object storage. 2. Generate segments by modality: scenes, pages, speaker turns, crops. 3. Run extractors on each segment: ASR, OCR, detection, captioning, embeddings. 4. Write observations into a structured store with provenance. 5. Write features into layered indexes: dense, sparse, structured, temporal. 6. Expose retrieval tools that return evidence packets. 7. Let the agent inspect, refine, cite, or escalate based on evidence quality.

    The key design principle: extraction is asynchronous and expensive, retrieval is synchronous and cheap. Do not make the agent wait for heavy perception every time it asks a question. Precompute observations, then let the agent query them quickly.

    How This Maps to Mixpeek



    Mixpeek models the same architecture with collections, feature extractors, retrievers, and namespaces.

    A video collection might extract transcript, visual embeddings, object detections, OCR, and scene captions. A retriever can then combine stages:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    results = mx.retrievers.search( retriever_id="video-evidence-retriever", query="clips where the host discusses pricing while the product is visible", pipeline=[ { "stage_type": "search", "stage_id": "transcript", "feature": "transcription", "limit": 100 }, { "stage_type": "filter", "stage_id": "product_visible", "feature": "object_detection", "conditions": [ {"field": "label", "operator": "contains", "value": "product"} ] }, { "stage_type": "fusion", "stage_id": "ranked_evidence", "method": "reciprocal_rank_fusion", "limit": 20 } ] )


    For teams bringing their own embeddings, MVS can serve the vector layer. For teams that need managed perception, Mixpeek can run the extraction layer and populate the observations and features from the media already stored in object storage.

    The important idea is not the vendor. It is the separation of concerns:

  86. Extract once.
  87. Preserve provenance.
  88. Index each evidence layer correctly.
  89. Let the agent retrieve evidence, not guesses.


  90. Design Checklist



    Use this checklist when designing an agent perception layer:

  91. Can every result cite its source object?
  92. Can every visual result cite a timestamp or bounding box?
  93. Can every document result cite a page or layout region?
  94. Are model name and version stored on every feature?
  95. Are confidence scores stored and exposed?
  96. Are dense, sparse, structured, and temporal indexes separated?
  97. Can queries join observations across modalities?
  98. Can the agent request neighboring context?
  99. Can you reindex when an embedding model changes?
  100. Can you audit which extractor produced a claim?


  101. Key Takeaways



    1. Agents need observations, not just files. A raw video or PDF is not a useful retrieval unit.

    2. The atomic unit of multimodal search is a localized observation with time, space, confidence, model version, and source lineage.

    3. Dense vectors are one layer, not the whole system. Sparse, structured, temporal, and lineage indexes are equally important.

    4. Multimodal queries often contain relationships: spoken content plus visual content plus temporal overlap. Single-stage search rarely satisfies all constraints.

    5. Retrieval tools for agents should return evidence packets that can be inspected and cited.

    6. The best perception systems precompute expensive observations and make retrieval fast enough for iterative agent loops.

    Further Reading



  102. Multimodal Chunking Strategies
  103. Multi-Index Search Architecture
  104. Video Temporal Grounding
  105. MCP Tool Design for Multimodal Search
  106. Late Interaction Retrieval
  107. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs