NEWVectors or files. Pick a path.Start →
    Agent Perception
    17 min read
    Updated 2026-06-07

    Production Ingestion Reliability for Agent Perception: Ledgers, Backfills, and Recall Checks

    A practical guide to making multimodal ingestion trustworthy. Learn the reliability patterns that keep video frames, transcripts, OCR spans, embeddings, payloads, and indexes complete enough for AI agents to search.

    Agent Perception
    Ingestion
    Reliability
    Backfills
    Recall Checks
    MVS

    The Problem: Agents Cannot Search What Was Never Indexed



    Retrieval quality is usually discussed as an embedding problem, a ranking problem, or a chunking problem. In production multimodal systems, there is a lower-level failure mode: the evidence never becomes searchable in the first place.

    An AI agent that reviews videos, listens to calls, reads PDFs, or inspects screenshots depends on an ingestion pipeline. That pipeline has to turn raw objects into searchable evidence:

  1. video frames and scene boundaries
  2. audio spans and transcripts
  3. OCR spans and page regions
  4. detected objects, faces, logos, and attributes
  5. dense, sparse, and lexical indexes
  6. payloads with timestamps, source URIs, model IDs, and permissions


  7. If any step silently drops rows, the agent may give a confident answer over an incomplete world. It does not know the red-light frame failed to embed. It does not know the transcript backfill skipped a file. It does not know OCR succeeded but the vector payload never committed.

    This guide teaches the reliability layer around multimodal ingestion: ledgers, invariants, idempotent document IDs, backfills, self-recall checks, and agent-facing diagnostics.

    Why Multimodal Ingestion Fails Differently



    Text-only RAG ingestion usually has a simple shape: load a document, split it, embed chunks, insert vectors.

    Multimodal ingestion fans out:

    1. A source object is discovered in object storage. 2. The object is normalized and validated. 3. Video is split into scenes, keyframes, audio, and thumbnails. 4. Audio is chunked and transcribed. 5. Pages or frames are sent through OCR. 6. Vision models generate captions, objects, faces, or embeddings. 7. Derived evidence is written to payload storage. 8. Vectors are written to one or more indexes. 9. Metadata is joined back to the source object. 10. A retriever becomes allowed to search the result.

    Each step can fail independently. A batch can look complete while one extractor swallowed a row-level error. A vector write can succeed while payload metadata is missing. A retry can create duplicate evidence. A downstream index can be stale even though the upstream batch is done.

    That means the ingestion system needs explicit accounting. "The job finished" is not enough. You need to know what was submitted, what succeeded, what failed, what was skipped intentionally, and what disappeared.

    The Evidence Accounting Model



    Use a simple invariant for every batch and every stage:

    submitted = processed + failed + skipped + lost
    


    Definitions:

    TermMeaningAction
    submittedObject, segment, page, or span was assigned to a stageMust end in exactly one terminal ledger
    processedStage wrote the expected evidence and index recordsSafe to search if downstream indexes are fresh
    failedStage ended with a known errorCan be retried, inspected, or escalated
    skippedStage intentionally did no workShould include a reason
    lostSubmitted work has no terminal recordTreat as an ingestion correctness bug
    The "lost" bucket is the important one. It catches silent data loss: rows that were scheduled but never reached success, failure, or skip accounting.

    For agent perception, lost evidence is worse than a visible failure. A visible failure can be retried. A lost row creates false confidence.

    Source Object to Searchable Evidence



    Every searchable evidence record should preserve lineage:

    {
      "document_id": "vid_42:scene:0007:caption:qwen3vl8b:v1",
      "source_uri": "s3://media/raw/vid_42.mp4",
      "source_version": "etag:9f18...",
      "object_id": "vid_42",
      "modality": "video",
      "span": {"start_sec": 14.2, "end_sec": 18.9},
      "feature": "scene_caption",
      "model_id": "Qwen/Qwen3-VL-8B-Instruct",
      "extractor_version": "scene_caption@v1",
      "payload_uri": "s3://derived/vid_42/scenes/0007.json",
      "vector_index": "visual-agent-memory",
      "created_at": "2026-06-07T14:12:00Z"
    }
    


    This record answers four operational questions:

    1. What raw object produced this evidence? 2. Which model and extractor produced it? 3. Which exact time, page, region, or segment does it describe? 4. Which index should return it during search?

    Without that lineage, debugging retrieval becomes guesswork.

    Deterministic IDs and Idempotent Writes



    Retries are normal. Workers restart. Object storage calls time out. GPU jobs get preempted. Agents and orchestrators replay steps.

    If retries generate random document IDs, they can create duplicate vectors. If retries overwrite blindly, they can corrupt a newer extraction result. The fix is deterministic identity.

    A robust document ID is derived from stable inputs:

    document_id =
      hash(
        namespace,
        source_uri,
        source_version,
        feature_name,
        extractor_version,
        model_id,
        chunk_key
      )
    


    For a video scene, the chunk key might be `scene_0007_14.2_18.9`. For a PDF page, it might be `page_004_region_12`. For an audio span, it might be `audio_000120_000150`.

    This makes writes idempotent:

  8. Same source, model, extractor, and chunk writes the same document.
  9. Retrying a successful write updates or confirms the same record.
  10. Changing model version creates a new record instead of overwriting old evidence.
  11. Backfills can merge missing features without duplicating existing ones.


  12. Idempotency is not only an API nicety. It is a correctness requirement for agent memory.

    Failure Ledgers



    Every stage should write a terminal record for every submitted unit.

    A failure record should include:

  13. source object ID
  14. stage name
  15. extractor or model ID
  16. error class
  17. human-readable message
  18. retryable boolean
  19. worker or job ID
  20. attempt count
  21. relevant resource signal, such as OOM, timeout, validation error, or object-not-found


  22. Example:

    {
      "object_id": "vid_42",
      "stage": "transcription",
      "span": {"start_sec": 0, "end_sec": 1800},
      "status": "failed",
      "error_class": "GpuOutOfMemory",
      "message": "Audio chunk batch exceeded available memory",
      "retryable": true,
      "attempt": 2,
      "worker_id": "ray-job-7812"
    }
    


    Do not collapse this into "batch failed." Agents and operators need row-level recovery. If one 30-minute video failed transcription but 999 images indexed correctly, the system should preserve that detail.

    Backfills



    A backfill repairs or upgrades an existing index without reprocessing everything from scratch.

    Common backfill triggers:

  23. A new extractor is added.
  24. A model version changes.
  25. A bug dropped vectors or payloads.
  26. A previous extractor skipped a modality.
  27. A stale index needs to be rebuilt from derived evidence.
  28. A transcript merge missed audio spans.


  29. A safe backfill has these properties:

    1. It selects work from source truth, not from a possibly stale index. 2. It shards deterministically so the job can run in parallel. 3. It uses the same deterministic document IDs as normal ingestion. 4. It writes processed, failed, skipped, and lost accounting. 5. It can resume from a cursor. 6. It produces a before-and-after audit.

    Backfills should not be a one-off script with print statements. They are part of the ingestion system.

    Recall Checks



    Ingestion correctness is not proven by worker success logs. Test whether known evidence is searchable.

    Use several recall checks:

    Count Invariants



    Compare expected and indexed counts by stage:

    expected transcript spans: 18,240
    indexed transcript vectors: 18,240
    payload records: 18,240
    searchable records: 18,238
    gap: 2
    


    The gap matters even if it is small. The missing two spans may be the only evidence for a compliance claim.

    Canary Objects



    Keep a small set of known objects with known evidence:

  30. a video with a known spoken phrase
  31. a PDF with a known table cell
  32. an image with a known object
  33. a screenshot with a known error code
  34. an audio file with a known speaker turn


  35. After ingestion, query for those facts and assert they return the expected source and span.

    Self-Recall



    For every written vector, search for itself or a known paired query. The expected document should appear at rank 1 or within a strict top-k.

    Self-recall catches vector/payload divergence:

  36. vector exists but payload is missing
  37. payload exists but vector is missing
  38. vector was written to the wrong namespace
  39. model dimension or distance metric is wrong
  40. index refresh has not caught up


  41. Sentinel Queries



    Write a small suite of natural-language queries that represent important agent tasks:

    [
      {
        "query": "invoice page with ACME renewal table",
        "expected_document_id": "pdf_102:page:004:ocr:v1"
      },
      {
        "query": "customer says setup failed while the red indicator light is visible",
        "expected_document_id": "call_88:scene:0012:fused:v1"
      }
    ]
    


    Run sentinel queries after deploys, backfills, model upgrades, and index rebuilds.

    Freshness and Read-After-Write



    Object-storage-backed vector systems often separate write durability from search readiness. A write can be committed to a log before all optimized indexes are built.

    That is not inherently bad. It becomes a problem when the system hides the state.

    Expose freshness in the search contract:

    {
      "namespace": "support-videos",
      "write_committed_at": "2026-06-07T14:12:00Z",
      "indexed_through": "2026-06-07T14:11:52Z",
      "unindexed_bytes": 1839204,
      "strong_consistency_available": true
    }
    


    An agent can use this signal:

  42. If it just wrote memory, request a strong-consistency search.
  43. If stale results are acceptable, use the faster normal path.
  44. If a namespace is still indexing, wait, retry, or tell the user evidence is not ready.


  45. Freshness metadata prevents "I just uploaded it but search cannot find it" from turning into mystery behavior.

    Diagnostics for Agents



    Agents need machine-readable failure explanations, not just HTTP errors.

    A good retrieval or ingestion tool should be able to say:

  46. The object was never submitted.
  47. The object was submitted and failed OCR.
  48. The object was processed but is not indexed yet.
  49. The vector exists but the payload is missing.
  50. The filter excluded every matching result.
  51. The query searched the wrong namespace.
  52. The relevant stage was skipped because no audio track was detected.


  53. Example diagnostic payload:

    {
      "status": "completed_with_errors",
      "coverage": {
        "submitted": 1000,
        "processed": 982,
        "failed": 16,
        "skipped": 2,
        "lost": 0
      },
      "search_readiness": {
        "vectors_indexed": 982,
        "payloads_indexed": 982,
        "fresh": true
      },
      "top_failure_classes": [
        {"class": "ObjectNotFound", "count": 9},
        {"class": "GpuOutOfMemory", "count": 7}
      ]
    }
    


    The agent can route this into recovery:

  54. retry retryable failures
  55. request human review for non-retryable files
  56. search only complete modalities
  57. wait for index freshness
  58. report partial coverage in the final answer


  59. Agent-Facing Tool Contract



    If an agent uses retrieval as a tool, the tool should return evidence and reliability state.

    Minimal fields:

  60. results
  61. source object IDs
  62. timestamps, pages, or regions
  63. matched stages
  64. model and extractor versions
  65. freshness status
  66. coverage summary
  67. partial failure summary
  68. follow-up inspection handles


  69. Example:

    {
      "query": "find the clip where the setup failed",
      "results": [
        {
          "document_id": "call_88:scene:0012:fused:v1",
          "source_uri": "s3://calls/call_88.mp4",
          "timestamp": {"start": 122.4, "end": 135.9},
          "matched_stages": ["transcription", "visual_caption"],
          "why": "Transcript says setup failed and scene caption shows red indicator"
        }
      ],
      "coverage": {
        "transcription": "complete",
        "scene_caption": "complete",
        "ocr": "skipped_no_text_regions"
      },
      "freshness": "fresh"
    }
    


    This prevents the agent from treating partial evidence as complete evidence.

    How This Maps to Mixpeek and MVS



    Mixpeek's managed ingestion layer decomposes raw objects into features. MVS stores and searches the vector layer on object storage. The reliability pattern is the same whether you use managed extraction or bring your own vectors.

    For managed ingestion, track per-batch accounting:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    batch = mx.ingest.videos( collection="support_calls", source={"type": "s3", "bucket": "support-media", "prefix": "calls/"}, pipeline={ "transcription": {"model": "openai/whisper-large-v3"}, "scene_caption": {"model": "Qwen/Qwen3-VL-8B-Instruct"}, "visual_embedding": {"model": "Qwen/Qwen3-VL-Embedding-2B"} }, idempotency_key="support_calls_2026_06_07" )

    audit = mx.batches.audit(batch.id)

    assert audit["submitted"] == ( audit["processed"] + audit["failed"] + audit["skipped"] + audit["lost"] ) assert audit["lost"] == 0


    For MVS standalone, make document IDs deterministic and run self-recall checks after upsert:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    doc_id = "call_88:scene:0012:qwen3vl_embedding:v1"

    mx.mvs.upsert( namespace="support_calls", vectors=[{ "id": doc_id, "values": embedding, "metadata": { "source_uri": "s3://support-media/calls/call_88.mp4", "start_sec": 122.4, "end_sec": 135.9, "feature": "visual_embedding", "model_id": "Qwen/Qwen3-VL-Embedding-2B" } }] )

    hits = mx.mvs.search( namespace="support_calls", vector=embedding, top_k=5, consistency="strong" )

    assert hits[0]["id"] == doc_id


    The implementation details will vary by SDK version, but the reliability requirements stay stable: deterministic IDs, complete ledgers, backfills, recall checks, and freshness metadata.

    Design Checklist



  70. Every source object has a stable ID and source version.
  71. Every derived observation has a deterministic document ID.
  72. Every stage has submitted, processed, failed, skipped, and lost counts.
  73. Lost work is treated as a correctness bug, not a warning.
  74. Failures store machine-readable error classes and retryability.
  75. Backfills use the same IDs and ledgers as normal ingestion.
  76. Vector records and payload records can be reconciled.
  77. Search responses expose freshness or consistency state.
  78. Canary objects cover video, audio, OCR, images, and documents.
  79. Self-recall checks run after bulk upserts and index rebuilds.
  80. Agent tools return coverage and partial-failure state.
  81. Final agent answers can cite source object, timestamp, page, or region.


  82. Key Takeaways



    1. Agent perception depends on ingestion correctness. Ranking cannot recover evidence that was never indexed.

    2. Multimodal ingestion needs per-stage accounting because every modality can fail independently.

    3. Use the invariant submitted = processed + failed + skipped + lost. Treat lost rows as silent data loss.

    4. Deterministic IDs make retries, backfills, and model upgrades safe.

    5. Recall checks should search for known evidence, not just inspect worker logs.

    6. Object-storage vector systems should expose freshness so agents know whether new writes are searchable.

    7. Agent retrieval tools should return coverage and diagnostics alongside results.

    Further Reading



  83. Agent Perception Evals
  84. Retrieval Control Planes for AI Agents
  85. Multimodal Chunking Strategies
  86. MCP Tool Design for Multimodal Search
  87. OpenAI tools guide
  88. Model Context Protocol roadmap
  89. LlamaIndex introduction to RAG
  90. turbopuffer concepts
  91. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs