The Problem: Agents Cannot Search What Was Never Indexed
Retrieval quality is usually discussed as an embedding problem, a ranking problem, or a chunking problem. In production multimodal systems, there is a lower-level failure mode: the evidence never becomes searchable in the first place.
An AI agent that reviews videos, listens to calls, reads PDFs, or inspects screenshots depends on an ingestion pipeline. That pipeline has to turn raw objects into searchable evidence:
If any step silently drops rows, the agent may give a confident answer over an incomplete world. It does not know the red-light frame failed to embed. It does not know the transcript backfill skipped a file. It does not know OCR succeeded but the vector payload never committed.
This guide teaches the reliability layer around multimodal ingestion: ledgers, invariants, idempotent document IDs, backfills, self-recall checks, and agent-facing diagnostics.
Why Multimodal Ingestion Fails Differently
Text-only RAG ingestion usually has a simple shape: load a document, split it, embed chunks, insert vectors.
Multimodal ingestion fans out:
1. A source object is discovered in object storage. 2. The object is normalized and validated. 3. Video is split into scenes, keyframes, audio, and thumbnails. 4. Audio is chunked and transcribed. 5. Pages or frames are sent through OCR. 6. Vision models generate captions, objects, faces, or embeddings. 7. Derived evidence is written to payload storage. 8. Vectors are written to one or more indexes. 9. Metadata is joined back to the source object. 10. A retriever becomes allowed to search the result.
Each step can fail independently. A batch can look complete while one extractor swallowed a row-level error. A vector write can succeed while payload metadata is missing. A retry can create duplicate evidence. A downstream index can be stale even though the upstream batch is done.
That means the ingestion system needs explicit accounting. "The job finished" is not enough. You need to know what was submitted, what succeeded, what failed, what was skipped intentionally, and what disappeared.
The Evidence Accounting Model
Use a simple invariant for every batch and every stage:
submitted = processed + failed + skipped + lost
Definitions:
| Term | Meaning | Action |
| submitted | Object, segment, page, or span was assigned to a stage | Must end in exactly one terminal ledger |
| processed | Stage wrote the expected evidence and index records | Safe to search if downstream indexes are fresh |
| failed | Stage ended with a known error | Can be retried, inspected, or escalated |
| skipped | Stage intentionally did no work | Should include a reason |
| lost | Submitted work has no terminal record | Treat as an ingestion correctness bug |
For agent perception, lost evidence is worse than a visible failure. A visible failure can be retried. A lost row creates false confidence.
Source Object to Searchable Evidence
Every searchable evidence record should preserve lineage:
{
"document_id": "vid_42:scene:0007:caption:qwen3vl8b:v1",
"source_uri": "s3://media/raw/vid_42.mp4",
"source_version": "etag:9f18...",
"object_id": "vid_42",
"modality": "video",
"span": {"start_sec": 14.2, "end_sec": 18.9},
"feature": "scene_caption",
"model_id": "Qwen/Qwen3-VL-8B-Instruct",
"extractor_version": "scene_caption@v1",
"payload_uri": "s3://derived/vid_42/scenes/0007.json",
"vector_index": "visual-agent-memory",
"created_at": "2026-06-07T14:12:00Z"
}
This record answers four operational questions:
1. What raw object produced this evidence? 2. Which model and extractor produced it? 3. Which exact time, page, region, or segment does it describe? 4. Which index should return it during search?
Without that lineage, debugging retrieval becomes guesswork.
Deterministic IDs and Idempotent Writes
Retries are normal. Workers restart. Object storage calls time out. GPU jobs get preempted. Agents and orchestrators replay steps.
If retries generate random document IDs, they can create duplicate vectors. If retries overwrite blindly, they can corrupt a newer extraction result. The fix is deterministic identity.
A robust document ID is derived from stable inputs:
document_id =
hash(
namespace,
source_uri,
source_version,
feature_name,
extractor_version,
model_id,
chunk_key
)
For a video scene, the chunk key might be `scene_0007_14.2_18.9`. For a PDF page, it might be `page_004_region_12`. For an audio span, it might be `audio_000120_000150`.
This makes writes idempotent:
Idempotency is not only an API nicety. It is a correctness requirement for agent memory.
Failure Ledgers
Every stage should write a terminal record for every submitted unit.
A failure record should include:
Example:
{
"object_id": "vid_42",
"stage": "transcription",
"span": {"start_sec": 0, "end_sec": 1800},
"status": "failed",
"error_class": "GpuOutOfMemory",
"message": "Audio chunk batch exceeded available memory",
"retryable": true,
"attempt": 2,
"worker_id": "ray-job-7812"
}
Do not collapse this into "batch failed." Agents and operators need row-level recovery. If one 30-minute video failed transcription but 999 images indexed correctly, the system should preserve that detail.
Backfills
A backfill repairs or upgrades an existing index without reprocessing everything from scratch.
Common backfill triggers:
A safe backfill has these properties:
1. It selects work from source truth, not from a possibly stale index. 2. It shards deterministically so the job can run in parallel. 3. It uses the same deterministic document IDs as normal ingestion. 4. It writes processed, failed, skipped, and lost accounting. 5. It can resume from a cursor. 6. It produces a before-and-after audit.
Backfills should not be a one-off script with print statements. They are part of the ingestion system.
Recall Checks
Ingestion correctness is not proven by worker success logs. Test whether known evidence is searchable.
Use several recall checks:
Count Invariants
Compare expected and indexed counts by stage:
expected transcript spans: 18,240
indexed transcript vectors: 18,240
payload records: 18,240
searchable records: 18,238
gap: 2
The gap matters even if it is small. The missing two spans may be the only evidence for a compliance claim.
Canary Objects
Keep a small set of known objects with known evidence:
After ingestion, query for those facts and assert they return the expected source and span.
Self-Recall
For every written vector, search for itself or a known paired query. The expected document should appear at rank 1 or within a strict top-k.
Self-recall catches vector/payload divergence:
Sentinel Queries
Write a small suite of natural-language queries that represent important agent tasks:
[
{
"query": "invoice page with ACME renewal table",
"expected_document_id": "pdf_102:page:004:ocr:v1"
},
{
"query": "customer says setup failed while the red indicator light is visible",
"expected_document_id": "call_88:scene:0012:fused:v1"
}
]
Run sentinel queries after deploys, backfills, model upgrades, and index rebuilds.
Freshness and Read-After-Write
Object-storage-backed vector systems often separate write durability from search readiness. A write can be committed to a log before all optimized indexes are built.
That is not inherently bad. It becomes a problem when the system hides the state.
Expose freshness in the search contract:
{
"namespace": "support-videos",
"write_committed_at": "2026-06-07T14:12:00Z",
"indexed_through": "2026-06-07T14:11:52Z",
"unindexed_bytes": 1839204,
"strong_consistency_available": true
}
An agent can use this signal:
Freshness metadata prevents "I just uploaded it but search cannot find it" from turning into mystery behavior.
Diagnostics for Agents
Agents need machine-readable failure explanations, not just HTTP errors.
A good retrieval or ingestion tool should be able to say:
Example diagnostic payload:
{
"status": "completed_with_errors",
"coverage": {
"submitted": 1000,
"processed": 982,
"failed": 16,
"skipped": 2,
"lost": 0
},
"search_readiness": {
"vectors_indexed": 982,
"payloads_indexed": 982,
"fresh": true
},
"top_failure_classes": [
{"class": "ObjectNotFound", "count": 9},
{"class": "GpuOutOfMemory", "count": 7}
]
}
The agent can route this into recovery:
Agent-Facing Tool Contract
If an agent uses retrieval as a tool, the tool should return evidence and reliability state.
Minimal fields:
Example:
{
"query": "find the clip where the setup failed",
"results": [
{
"document_id": "call_88:scene:0012:fused:v1",
"source_uri": "s3://calls/call_88.mp4",
"timestamp": {"start": 122.4, "end": 135.9},
"matched_stages": ["transcription", "visual_caption"],
"why": "Transcript says setup failed and scene caption shows red indicator"
}
],
"coverage": {
"transcription": "complete",
"scene_caption": "complete",
"ocr": "skipped_no_text_regions"
},
"freshness": "fresh"
}
This prevents the agent from treating partial evidence as complete evidence.
How This Maps to Mixpeek and MVS
Mixpeek's managed ingestion layer decomposes raw objects into features. MVS stores and searches the vector layer on object storage. The reliability pattern is the same whether you use managed extraction or bring your own vectors.
For managed ingestion, track per-batch accounting:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
batch = mx.ingest.videos(
collection="support_calls",
source={"type": "s3", "bucket": "support-media", "prefix": "calls/"},
pipeline={
"transcription": {"model": "openai/whisper-large-v3"},
"scene_caption": {"model": "Qwen/Qwen3-VL-8B-Instruct"},
"visual_embedding": {"model": "Qwen/Qwen3-VL-Embedding-2B"}
},
idempotency_key="support_calls_2026_06_07"
)
audit = mx.batches.audit(batch.id)
assert audit["submitted"] == (
audit["processed"] + audit["failed"] + audit["skipped"] + audit["lost"]
)
assert audit["lost"] == 0
For MVS standalone, make document IDs deterministic and run self-recall checks after upsert:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
doc_id = "call_88:scene:0012:qwen3vl_embedding:v1"
mx.mvs.upsert(
namespace="support_calls",
vectors=[{
"id": doc_id,
"values": embedding,
"metadata": {
"source_uri": "s3://support-media/calls/call_88.mp4",
"start_sec": 122.4,
"end_sec": 135.9,
"feature": "visual_embedding",
"model_id": "Qwen/Qwen3-VL-Embedding-2B"
}
}]
)
hits = mx.mvs.search(
namespace="support_calls",
vector=embedding,
top_k=5,
consistency="strong"
)
assert hits[0]["id"] == doc_id
The implementation details will vary by SDK version, but the reliability requirements stay stable: deterministic IDs, complete ledgers, backfills, recall checks, and freshness metadata.
Design Checklist
Key Takeaways
1. Agent perception depends on ingestion correctness. Ranking cannot recover evidence that was never indexed.
2. Multimodal ingestion needs per-stage accounting because every modality can fail independently.
3. Use the invariant submitted = processed + failed + skipped + lost. Treat lost rows as silent data loss.
4. Deterministic IDs make retries, backfills, and model upgrades safe.
5. Recall checks should search for known evidence, not just inspect worker logs.
6. Object-storage vector systems should expose freshness so agents know whether new writes are searchable.
7. Agent retrieval tools should return coverage and diagnostics alongside results.