Production Ingestion Reliability for Agent Perception: Ledgers, Backfills, and Recall Checks

The Problem: Agents Cannot Search What Was Never Indexed

Retrieval quality is usually discussed as an embedding problem, a ranking problem, or a chunking problem. In production multimodal systems, there is a lower-level failure mode: the evidence never becomes searchable in the first place.

An AI agent that reviews videos, listens to calls, reads PDFs, or inspects screenshots depends on an ingestion pipeline. That pipeline has to turn raw objects into searchable evidence:

video frames and scene boundaries

audio spans and transcripts

OCR spans and page regions

detected objects, faces, logos, and attributes

dense, sparse, and lexical indexes

payloads with timestamps, source URIs, model IDs, and permissions

If any step silently drops rows, the agent may give a confident answer over an incomplete world. It does not know the red-light frame failed to embed. It does not know the transcript backfill skipped a file. It does not know OCR succeeded but the vector payload never committed.

This guide teaches the reliability layer around multimodal ingestion: ledgers, invariants, idempotent document IDs, backfills, self-recall checks, and agent-facing diagnostics.

Why Multimodal Ingestion Fails Differently

Text-only RAG ingestion usually has a simple shape: load a document, split it, embed chunks, insert vectors.

Multimodal ingestion fans out:

1. A source object is discovered in object storage. 2. The object is normalized and validated. 3. Video is split into scenes, keyframes, audio, and thumbnails. 4. Audio is chunked and transcribed. 5. Pages or frames are sent through OCR. 6. Vision models generate captions, objects, faces, or embeddings. 7. Derived evidence is written to payload storage. 8. Vectors are written to one or more indexes. 9. Metadata is joined back to the source object. 10. A retriever becomes allowed to search the result.

Each step can fail independently. A batch can look complete while one extractor swallowed a row-level error. A vector write can succeed while payload metadata is missing. A retry can create duplicate evidence. A downstream index can be stale even though the upstream batch is done.

That means the ingestion system needs explicit accounting. "The job finished" is not enough. You need to know what was submitted, what succeeded, what failed, what was skipped intentionally, and what disappeared.

The Evidence Accounting Model

Use a simple invariant for every batch and every stage:

submitted = processed + failed + skipped + lost

Definitions:

Term

Meaning

Action

submitted	Object, segment, page, or span was assigned to a stage	Must end in exactly one terminal ledger
processed	Stage wrote the expected evidence and index records	Safe to search if downstream indexes are fresh
failed	Stage ended with a known error	Can be retried, inspected, or escalated
skipped	Stage intentionally did no work	Should include a reason
lost	Submitted work has no terminal record	Treat as an ingestion correctness bug

The "lost" bucket is the important one. It catches silent data loss: rows that were scheduled but never reached success, failure, or skip accounting.

For agent perception, lost evidence is worse than a visible failure. A visible failure can be retried. A lost row creates false confidence.

Source Object to Searchable Evidence

Every searchable evidence record should preserve lineage:

{
  "document_id": "vid_42:scene:0007:caption:qwen3vl8b:v1",
  "source_uri": "s3://media/raw/vid_42.mp4",
  "source_version": "etag:9f18...",
  "object_id": "vid_42",
  "modality": "video",
  "span": {"start_sec": 14.2, "end_sec": 18.9},
  "feature": "scene_caption",
  "model_id": "Qwen/Qwen3-VL-8B-Instruct",
  "extractor_version": "scene_caption@v1",
  "payload_uri": "s3://derived/vid_42/scenes/0007.json",
  "vector_index": "visual-agent-memory",
  "created_at": "2026-06-07T14:12:00Z"
}

This record answers four operational questions:

1. What raw object produced this evidence? 2. Which model and extractor produced it? 3. Which exact time, page, region, or segment does it describe? 4. Which index should return it during search?

Without that lineage, debugging retrieval becomes guesswork.

Deterministic IDs and Idempotent Writes

Retries are normal. Workers restart. Object storage calls time out. GPU jobs get preempted. Agents and orchestrators replay steps.

If retries generate random document IDs, they can create duplicate vectors. If retries overwrite blindly, they can corrupt a newer extraction result. The fix is deterministic identity.

A robust document ID is derived from stable inputs:

document_id =
  hash(
    namespace,
    source_uri,
    source_version,
    feature_name,
    extractor_version,
    model_id,
    chunk_key
  )

For a video scene, the chunk key might be scene_0007_14.2_18.9. For a PDF page, it might be page_004_region_12. For an audio span, it might be audio_000120_000150.

This makes writes idempotent:

Same source, model, extractor, and chunk writes the same document.

Retrying a successful write updates or confirms the same record.

Changing model version creates a new record instead of overwriting old evidence.

Backfills can merge missing features without duplicating existing ones.

Idempotency is not only an API nicety. It is a correctness requirement for agent memory.

Failure Ledgers

Every stage should write a terminal record for every submitted unit.

A failure record should include:

source object ID

stage name

extractor or model ID

error class

human-readable message

retryable boolean

worker or job ID

attempt count

relevant resource signal, such as OOM, timeout, validation error, or object-not-found

Example:

{
  "object_id": "vid_42",
  "stage": "transcription",
  "span": {"start_sec": 0, "end_sec": 1800},
  "status": "failed",
  "error_class": "GpuOutOfMemory",
  "message": "Audio chunk batch exceeded available memory",
  "retryable": true,
  "attempt": 2,
  "worker_id": "ray-job-7812"
}

Do not collapse this into "batch failed." Agents and operators need row-level recovery. If one 30-minute video failed transcription but 999 images indexed correctly, the system should preserve that detail.

Backfills

A backfill repairs or upgrades an existing index without reprocessing everything from scratch.

Common backfill triggers:

A new extractor is added.

A model version changes.

A bug dropped vectors or payloads.

A previous extractor skipped a modality.

A stale index needs to be rebuilt from derived evidence.

A transcript merge missed audio spans.

A safe backfill has these properties:

1. It selects work from source truth, not from a possibly stale index. 2. It shards deterministically so the job can run in parallel. 3. It uses the same deterministic document IDs as normal ingestion. 4. It writes processed, failed, skipped, and lost accounting. 5. It can resume from a cursor. 6. It produces a before-and-after audit.

Backfills should not be a one-off script with print statements. They are part of the ingestion system.

Recall Checks

Ingestion correctness is not proven by worker success logs. Test whether known evidence is searchable.

Use several recall checks:

Count Invariants

Compare expected and indexed counts by stage:

expected transcript spans: 18,240
indexed transcript vectors: 18,240
payload records: 18,240
searchable records: 18,238
gap: 2

The gap matters even if it is small. The missing two spans may be the only evidence for a compliance claim.

Canary Objects

Keep a small set of known objects with known evidence:

a video with a known spoken phrase

a PDF with a known table cell

an image with a known object

a screenshot with a known error code

an audio file with a known speaker turn

After ingestion, query for those facts and assert they return the expected source and span.

Self-Recall

For every written vector, search for itself or a known paired query. The expected document should appear at rank 1 or within a strict top-k.

Self-recall catches vector/payload divergence:

vector exists but payload is missing

payload exists but vector is missing

vector was written to the wrong namespace

model dimension or distance metric is wrong

index refresh has not caught up

Sentinel Queries

Write a small suite of natural-language queries that represent important agent tasks:

[
  {
    "query": "invoice page with ACME renewal table",
    "expected_document_id": "pdf_102:page:004:ocr:v1"
  },
  {
    "query": "customer says setup failed while the red indicator light is visible",
    "expected_document_id": "call_88:scene:0012:fused:v1"
  }
]

Run sentinel queries after deploys, backfills, model upgrades, and index rebuilds.

Freshness and Read-After-Write

Object-storage-backed vector systems often separate write durability from search readiness. A write can be committed to a log before all optimized indexes are built.

That is not inherently bad. It becomes a problem when the system hides the state.

Expose freshness in the search contract:

{
  "namespace": "support-videos",
  "write_committed_at": "2026-06-07T14:12:00Z",
  "indexed_through": "2026-06-07T14:11:52Z",
  "unindexed_bytes": 1839204,
  "strong_consistency_available": true
}

An agent can use this signal:

If it just wrote memory, request a strong-consistency search.

If stale results are acceptable, use the faster normal path.

If a namespace is still indexing, wait, retry, or tell the user evidence is not ready.

Freshness metadata prevents "I just uploaded it but search cannot find it" from turning into mystery behavior.

Diagnostics for Agents

Agents need machine-readable failure explanations, not just HTTP errors.

A good retrieval or ingestion tool should be able to say:

The object was never submitted.

The object was submitted and failed OCR.

The object was processed but is not indexed yet.

The vector exists but the payload is missing.

The filter excluded every matching result.

The query searched the wrong namespace.

The relevant stage was skipped because no audio track was detected.

Example diagnostic payload:

{
  "status": "completed_with_errors",
  "coverage": {
    "submitted": 1000,
    "processed": 982,
    "failed": 16,
    "skipped": 2,
    "lost": 0
  },
  "search_readiness": {
    "vectors_indexed": 982,
    "payloads_indexed": 982,
    "fresh": true
  },
  "top_failure_classes": [
    {"class": "ObjectNotFound", "count": 9},
    {"class": "GpuOutOfMemory", "count": 7}
  ]
}

The agent can route this into recovery:

retry retryable failures

request human review for non-retryable files

search only complete modalities

wait for index freshness

report partial coverage in the final answer

Agent-Facing Tool Contract

If an agent uses retrieval as a tool, the tool should return evidence and reliability state.

Minimal fields:

results

source object IDs

timestamps, pages, or regions

matched stages

model and extractor versions

freshness status

coverage summary

partial failure summary

follow-up inspection handles

Example:

{
  "query": "find the clip where the setup failed",
  "results": [
    {
      "document_id": "call_88:scene:0012:fused:v1",
      "source_uri": "s3://calls/call_88.mp4",
      "timestamp": {"start": 122.4, "end": 135.9},
      "matched_stages": ["transcription", "visual_caption"],
      "why": "Transcript says setup failed and scene caption shows red indicator"
    }
  ],
  "coverage": {
    "transcription": "complete",
    "scene_caption": "complete",
    "ocr": "skipped_no_text_regions"
  },
  "freshness": "fresh"
}

This prevents the agent from treating partial evidence as complete evidence.

How This Maps to Mixpeek and MVS

Mixpeek's managed ingestion layer decomposes raw objects into features. MVS stores and searches the vector layer on object storage. The reliability pattern is the same whether you use managed extraction or bring your own vectors.

For managed ingestion, track per-batch accounting:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

batch = mx.ingest.videos(
    collection="support_calls",
    source={"type": "s3", "bucket": "support-media", "prefix": "calls/"},
    pipeline={
        "transcription": {"model": "openai/whisper-large-v3"},
        "scene_caption": {"model": "Qwen/Qwen3-VL-8B-Instruct"},
        "visual_embedding": {"model": "Qwen/Qwen3-VL-Embedding-2B"}
    },
    idempotency_key="support_calls_2026_06_07"
)

audit = mx.batches.audit(batch.id)

assert audit["submitted"] == (
    audit["processed"] + audit["failed"] + audit["skipped"] + audit["lost"]
)
assert audit["lost"] == 0

For MVS standalone, make document IDs deterministic and run self-recall checks after upsert:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

doc_id = "call_88:scene:0012:qwen3vl_embedding:v1"

mx.mvs.upsert(
    namespace="support_calls",
    vectors=[{
        "id": doc_id,
        "values": embedding,
        "metadata": {
            "source_uri": "s3://support-media/calls/call_88.mp4",
            "start_sec": 122.4,
            "end_sec": 135.9,
            "feature": "visual_embedding",
            "model_id": "Qwen/Qwen3-VL-Embedding-2B"
        }
    }]
)

hits = mx.mvs.search(
    namespace="support_calls",
    vector=embedding,
    top_k=5,
    consistency="strong"
)

assert hits[0]["id"] == doc_id

The implementation details will vary by SDK version, but the reliability requirements stay stable: deterministic IDs, complete ledgers, backfills, recall checks, and freshness metadata.

Design Checklist

Every source object has a stable ID and source version.

Every derived observation has a deterministic document ID.

Every stage has submitted, processed, failed, skipped, and lost counts.

Lost work is treated as a correctness bug, not a warning.

Failures store machine-readable error classes and retryability.

Backfills use the same IDs and ledgers as normal ingestion.

Vector records and payload records can be reconciled.

Search responses expose freshness or consistency state.

Canary objects cover video, audio, OCR, images, and documents.

Self-recall checks run after bulk upserts and index rebuilds.

Agent tools return coverage and partial-failure state.

Final agent answers can cite source object, timestamp, page, or region.

Key Takeaways

1. Agent perception depends on ingestion correctness. Ranking cannot recover evidence that was never indexed.

2. Multimodal ingestion needs per-stage accounting because every modality can fail independently.

3. Use the invariant submitted = processed + failed + skipped + lost. Treat lost rows as silent data loss.

4. Deterministic IDs make retries, backfills, and model upgrades safe.

5. Recall checks should search for known evidence, not just inspect worker logs.

6. Object-storage vector systems should expose freshness so agents know whether new writes are searchable.

7. Agent retrieval tools should return coverage and diagnostics alongside results.

The Problem: Agents Cannot Search What Was Never Indexed

Why Multimodal Ingestion Fails Differently

The Evidence Accounting Model

Source Object to Searchable Evidence

Deterministic IDs and Idempotent Writes

Failure Ledgers

Backfills

Recall Checks

Count Invariants

Canary Objects

Self-Recall

Sentinel Queries

Freshness and Read-After-Write

Diagnostics for Agents

Agent-Facing Tool Contract

How This Maps to Mixpeek and MVS

Design Checklist

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored

Audio Fingerprinting: How Agents Recognize a Specific Recording in Noise