Why Multimodal RAG Is Hard
Text-only RAG is well understood: chunk documents, embed them, retrieve the top-k, feed them to an LLM. But most enterprise data is not text. It is video recordings, product images, audio calls, PDFs with diagrams, and slide decks with screenshots. A text-only RAG pipeline ignores 80% of the information an organization actually has.
The challenge with multimodal RAG is not conceptual. The retrieval-augmented generation pattern applies regardless of modality. The challenge is operational:
This guide walks through each stage of a multimodal RAG pipeline, from ingestion to generation, with concrete implementation patterns.
The Five Stages of Multimodal RAG
Every multimodal RAG system, regardless of scale, follows five stages:
1. Ingest -- Get raw files into the system and normalize them 2. Perceive -- Extract features, embeddings, and metadata from each modality 3. Index -- Store extracted representations for fast retrieval 4. Retrieve -- Find the most relevant pieces given a query 5. Generate -- Synthesize a response using retrieved context
The rest of this guide covers each stage in detail.
Stage 1: Ingestion and Normalization
Ingestion is where most teams underestimate complexity. Raw media files come in dozens of formats, codecs, resolutions, and encodings. A production pipeline needs to handle all of them without manual intervention.
File Type Detection
Do not trust file extensions. A file named `report.pdf` might be a scanned image masquerading as a PDF. A `.mp4` might use an unsupported codec. Always detect the actual file type from the binary header:
import magic
mime = magic.from_file("upload.pdf", mime=True)
# "application/pdf" -- genuine PDF
# "image/jpeg" -- scanned image with wrong extension
Modality Routing
Once you know the true file type, route to the appropriate processing pipeline:
| MIME Type Pattern | Modality | Processing Path |
| video/* | Video | Temporal segmentation + frame extraction + audio extraction |
| image/* | Image | Region detection + captioning + OCR |
| audio/* | Audio | Transcription + speaker diarization + audio fingerprinting |
| application/pdf | Document | Page extraction + layout analysis + table detection |
| text/* | Text | Chunking + entity extraction |
Temporal Segmentation for Video
Video is the most complex modality because it contains multiple signals: visual frames, audio, speech, on-screen text, and motion. The first step is breaking a long video into semantically coherent segments.
Scene-based segmentation detects visual transitions (cuts, fades, dissolves) and splits the video at those boundaries. This works well for edited content like movies, commercials, and news broadcasts.
Fixed-window segmentation splits the video into equal-length chunks (e.g., 10-second windows with 2-second overlap). This is simpler and works for surveillance footage, webcam recordings, and other unedited content.
Speech-based segmentation uses voice activity detection and speaker diarization to split at natural pause points. This is ideal for meetings, interviews, and podcasts where the visual track is secondary.
# Mixpeek handles segmentation automatically based on collection config
collection = client.collections.create(
namespace_id=ns.namespace_id,
collection_name="video-archive",
feature_extractors=[{
"extractor_type": "multimodal_embedding",
"model": "mixpeek-embed-v2",
"config": {
"chunk_strategy": "scene",
"chunk_duration_seconds": 15,
"chunk_overlap_seconds": 2
}
}]
)
Stage 2: Feature Extraction (Perceive)
Feature extraction converts raw media into searchable representations. Each modality produces different types of features:
Visual Features
Audio Features
Document Features
The Embedding Alignment Problem
When you embed a video frame with CLIP and a transcript chunk with a text embedding model, the resulting vectors live in different spaces. A cosine similarity between them is meaningless.
Three approaches to solving this:
Shared-space models like CLIP, ImageBind, and Mixpeek's multimodal embeddings project all modalities into a single vector space. A text query and an image live in the same space, so cross-modal retrieval works directly.
Late fusion retrieves from each modality independently and combines results at the ranking stage. You search the video index, the transcript index, and the document index separately, then merge the result lists using reciprocal rank fusion or a learned re-ranker.
Cross-encoders take a query and a candidate (from any modality) and produce a relevance score directly. These are more accurate than bi-encoders but too slow for first-stage retrieval. Use them as re-rankers on the top-k results from a faster first stage.
# Multi-stage retrieval with Mixpeek: embed + rerank
retriever = client.retrievers.create(
namespace_id=ns.namespace_id,
retriever_name="multimodal-search",
stages=[
{
"stage_type": "embedding",
"model": "mixpeek-embed-v2",
"limit": 100
},
{
"stage_type": "rerank",
"model": "mixpeek-rerank-v1",
"limit": 10
}
]
)
Stage 3: Indexing
Indexing is where extracted features become searchable. The storage layer must support:
Namespace Design
A namespace is a logical container for related vectors. Design your namespaces around query patterns, not organizational hierarchy:
By use case: `brand-safety-assets`, `product-catalog`, `support-recordings`
By modality: `video-frames`, `transcripts`, `documents` (useful when different modalities need different embedding models)
By tenant: `tenant-acme`, `tenant-globex` (required for multi-tenant SaaS applications)
Avoid mixing unrelated data in a single namespace. Retrieval quality degrades when the index contains semantically diverse content because the nearest neighbors become less meaningful.
Storage Tiering
Not all vectors need to be in hot storage. A production system should tier data by access frequency:
| Tier | Storage | Latency | Cost | Use Case |
| Hot | In-memory vector DB (Qdrant, Pinecone) | <10ms | $$ | Active search workloads |
| Warm | Disk-backed vector DB or S3 Vectors | 50-200ms | $ | Archives, infrequent queries |
| Cold | Object storage (S3) with on-demand loading | 1-5s | $ | Compliance retention, backup |
Stage 4: Retrieval
Retrieval is the most nuanced stage. A naive "embed the query, find the nearest vectors" approach works for demos but fails in production for several reasons:
Why Single-Stage Retrieval Fails
1. Vocabulary mismatch. A user searching for "red sedan" will miss a video tagged with "crimson car." Semantic embeddings help but do not eliminate this entirely. 2. Modality mismatch. A text query cannot directly match against audio fingerprints or object detection bounding boxes. 3. Precision vs. recall tradeoff. Embedding search optimizes for recall (finding anything remotely relevant). Production use cases often need precision (finding exactly the right thing).
Multi-Stage Retrieval
The solution is a pipeline of retrieval stages, each refining the results of the previous one:
Stage 1: Broad recall. Use embedding search with a generous limit (100-500 candidates). This casts a wide net and ensures you do not miss relevant results.
Stage 2: Metadata filtering. Apply hard filters: date range, file type, source bucket, content labels, compliance flags. This eliminates candidates that are semantically similar but contextually irrelevant.
Stage 3: Re-ranking. Use a cross-encoder or learned re-ranker to score the remaining candidates against the query with higher fidelity. Cross-encoders attend to fine-grained interactions between query and document that bi-encoders miss.
Stage 4: Deduplication. Remove near-duplicate results (common when the same content appears in multiple formats or when overlapping video segments match).
# Full multi-stage retrieval pipeline
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
query="person explaining quarterly revenue growth",
filters={
"file_type": {"$in": ["video/mp4", "video/webm"]},
"created_at": {"$gte": "2026-01-01"}
}
)
for doc in results.documents:
print(f"{doc.score:.3f} | {doc.metadata['source_file']}")
print(f" Segment: {doc.metadata.get('start_time', 'N/A')}s")
print(f" Transcript: {doc.content[:200]}")
Hybrid Search
Combine vector similarity with keyword matching for the best of both worlds. This catches exact terminology (product names, model numbers, legal terms) that embeddings might conflate with semantically similar but incorrect matches.
# Hybrid retrieval: semantic + keyword
retriever = client.retrievers.create(
namespace_id=ns.namespace_id,
retriever_name="hybrid-search",
stages=[
{
"stage_type": "hybrid",
"semantic_weight": 0.7,
"keyword_weight": 0.3,
"limit": 50
},
{
"stage_type": "rerank",
"model": "mixpeek-rerank-v1",
"limit": 10
}
]
)
Stage 5: Generation
The final stage feeds retrieved context to an LLM to produce a response. The key decisions here are context formatting and prompt construction.
Context Formatting
LLMs process text. Multimodal context must be serialized into a format the model can reason over:
For video segments: Include the transcript text, a natural-language description of the visual content, detected objects and faces, and the timestamp range. Do not pass raw frames unless you are using a vision-language model with sufficient context window.
For images: Include the caption, OCR text, detected objects with bounding boxes (as text descriptions), and any relevant EXIF metadata.
For audio: Include the transcript with speaker labels and timestamps. Note any significant non-speech sounds.
For documents: Include the extracted text with section headings preserved. For tables, use markdown table format. For figures, include the generated caption.
Context Window Management
A single video can produce thousands of transcript words and hundreds of keyframe descriptions. You cannot pass all of this to the LLM. Strategies for fitting within the context window:
1. Rank and truncate. Only include the top-k most relevant chunks. Simple but loses context. 2. Summarize then retrieve. Pre-compute summaries at multiple granularities (segment, scene, full video) and retrieve at the appropriate level. 3. Hierarchical context. Include a high-level summary of all retrieved documents, plus full detail for the top 3-5.
Grounding and Attribution
Always include source references in the generated output. Users need to verify claims, and downstream systems need to link back to the original media:
Based on the Q3 earnings call recording (2026-07-15, timestamp 12:34-13:01),
the CFO stated that revenue grew 23% year-over-year, driven primarily by
enterprise contract expansion.
Sources:
q3-earnings-call.mp4 [12:34-13:01] (transcript match, score: 0.94)
q3-earnings-deck.pdf [page 7] (table match, score: 0.89)
Production Deployment Patterns
Pattern 1: Batch Ingestion + Real-Time Retrieval
The most common pattern. Files are ingested and processed in batch (hourly, daily, or on-upload), while retrieval and generation happen in real time.
Best for: Media asset management, video archives, document search, knowledge bases.
Architecture:
1. Files uploaded to object storage (S3, GCS) 2. Upload triggers processing pipeline (feature extraction, embedding, indexing) 3. User queries hit a retrieval API that searches the pre-built index 4. Retrieved context is passed to an LLM for generation
Pattern 2: Streaming Ingestion + Real-Time Retrieval
Content is processed as it arrives, with near-zero delay between ingestion and searchability.
Best for: Content moderation, live event monitoring, social media analysis.
Architecture:
1. Media stream (live video, social feed) is segmented in real time 2. Each segment is processed immediately (feature extraction + embedding) 3. Vectors are indexed with sub-second latency 4. Monitoring queries run continuously against the growing index
Pattern 3: Agent-Driven Retrieval
An AI agent decides what to search for, how to refine the query, and when it has enough context to answer. The retrieval pipeline is exposed as a tool the agent can call.
Best for: Complex research tasks, multi-step reasoning, autonomous workflows.
Architecture:
1. Agent receives a task (e.g., "Find all instances of our logo being used incorrectly") 2. Agent formulates an initial query and calls the retrieval tool 3. Agent examines results, refines the query, and retrieves again 4. Agent synthesizes findings into a report
Mixpeek's MCP server exposes retrieval as a tool that any MCP-compatible agent can call, making this pattern straightforward to implement.
Common Mistakes
Embedding everything with the same model. Different modalities benefit from specialized models. Using CLIP for text-heavy documents or a text encoder for product images leaves performance on the table.
Skipping the re-ranking stage. Bi-encoder retrieval is fast but approximate. A re-ranker consistently improves precision by 15-30% on multimodal benchmarks. The latency cost (50-100ms for 100 candidates) is worth it for nearly every production use case.
Ignoring chunk boundaries. A video segment that starts mid-sentence or an image crop that cuts off a product label produces low-quality features. Invest in intelligent segmentation.
Not retaining raw source data. If you only store embeddings, you cannot re-embed when better models become available. Always keep the original files alongside the vectors. See the embedding portability guide for migration strategies.
Treating all queries the same. A keyword-style query ("invoice Q3 2025") and a semantic query ("someone explaining why revenue dropped") require different retrieval strategies. Use query classification to route to the appropriate pipeline.
Measuring RAG Quality
You cannot improve what you do not measure. Three metrics matter for multimodal RAG:
Retrieval recall at k: Of the relevant documents in the corpus, what fraction appears in the top-k retrieved results? Measure this with a golden test set of queries with known relevant documents.
Answer faithfulness: Does the generated answer only contain claims supported by the retrieved context? Unfaithful answers (hallucinations) are the primary failure mode of RAG systems.
End-to-end latency: Time from query submission to response delivery. Break this down by stage (embedding, retrieval, re-ranking, generation) to identify bottlenecks.
# Build a test harness
test_queries = [
{
"query": "product recall announcement 2025",
"expected_doc_ids": ["vid-8832", "doc-1204"],
},
{
"query": "warehouse safety incident",
"expected_doc_ids": ["vid-2291", "vid-2292", "doc-0887"],
},
]
for test in test_queries:
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
query=test["query"],
limit=20
)
retrieved_ids = [d.document_id for d in results.documents]
recall = len(set(test["expected_doc_ids"]) & set(retrieved_ids)) / len(test["expected_doc_ids"])
print(f"Query: {test['query']} | Recall@20: {recall:.0%}")
