How to Build a Multimodal RAG Pipeline

Why Multimodal RAG Is Hard

Text-only RAG is well understood: chunk documents, embed them, retrieve the top-k, feed them to an LLM. But most enterprise data is not text. It is video recordings, product images, audio calls, PDFs with diagrams, and slide decks with screenshots. A text-only RAG pipeline ignores 80% of the information an organization actually has.

The challenge with multimodal RAG is not conceptual. The retrieval-augmented generation pattern applies regardless of modality. The challenge is operational:

Different modalities require different chunking strategies. A 45-minute video needs temporal segmentation. A product image needs region-of-interest detection. A podcast needs speaker-diarized transcript segments. You cannot apply a single text splitter to all of these.

Embedding spaces are modality-specific. A CLIP embedding for an image frame lives in a different vector space than a Whisper-derived text embedding from the same video's audio track. Retrieval across these spaces requires alignment or multi-stage fusion.

Context windows have hard limits. You cannot pass a raw video file to an LLM. You need to extract the right signals (keyframes, transcript segments, detected objects) and present them as structured context that fits within the model's input budget.

Latency budgets vary by use case. A compliance scan that runs overnight can afford exhaustive multi-pass retrieval. A customer-facing search must return results in under 500ms. The same pipeline architecture does not serve both.

This guide walks through each stage of a multimodal RAG pipeline, from ingestion to generation, with concrete implementation patterns.

The Five Stages of Multimodal RAG

Every multimodal RAG system, regardless of scale, follows five stages:

1. Ingest -- Get raw files into the system and normalize them 2. Perceive -- Extract features, embeddings, and metadata from each modality 3. Index -- Store extracted representations for fast retrieval 4. Retrieve -- Find the most relevant pieces given a query 5. Generate -- Synthesize a response using retrieved context

The rest of this guide covers each stage in detail.

Stage 1: Ingestion and Normalization

Ingestion is where most teams underestimate complexity. Raw media files come in dozens of formats, codecs, resolutions, and encodings. A production pipeline needs to handle all of them without manual intervention.

File Type Detection

Do not trust file extensions. A file named report.pdf might be a scanned image masquerading as a PDF. A .mp4 might use an unsupported codec. Always detect the actual file type from the binary header:

import magic

mime = magic.from_file("upload.pdf", mime=True)
# "application/pdf" -- genuine PDF
# "image/jpeg" -- scanned image with wrong extension

Modality Routing

Once you know the true file type, route to the appropriate processing pipeline:

MIME Type Pattern

Modality

Processing Path

video/*	Video	Temporal segmentation + frame extraction + audio extraction
image/*	Image	Region detection + captioning + OCR
audio/*	Audio	Transcription + speaker diarization + audio fingerprinting
application/pdf	Document	Page extraction + layout analysis + table detection
text/*	Text	Chunking + entity extraction

Temporal Segmentation for Video

Video is the most complex modality because it contains multiple signals: visual frames, audio, speech, on-screen text, and motion. The first step is breaking a long video into semantically coherent segments.

Scene-based segmentation detects visual transitions (cuts, fades, dissolves) and splits the video at those boundaries. This works well for edited content like movies, commercials, and news broadcasts.

Fixed-window segmentation splits the video into equal-length chunks (e.g., 10-second windows with 2-second overlap). This is simpler and works for surveillance footage, webcam recordings, and other unedited content.

Speech-based segmentation uses voice activity detection and speaker diarization to split at natural pause points. This is ideal for meetings, interviews, and podcasts where the visual track is secondary.

# Mixpeek handles segmentation automatically based on collection config
collection = client.collections.create(
    namespace_id=ns.namespace_id,
    collection_name="video-archive",
    feature_extractors=[{
        "extractor_type": "multimodal_embedding",
        "model": "mixpeek-embed-v2",
        "config": {
            "chunk_strategy": "scene",
            "chunk_duration_seconds": 15,
            "chunk_overlap_seconds": 2
        }
    }]
)

Stage 2: Feature Extraction (Perceive)

Feature extraction converts raw media into searchable representations. Each modality produces different types of features:

Visual Features

Dense embeddings -- A single vector (typically 512-1024 dimensions) that captures the semantic content of an image or video frame. Models like CLIP, SigLIP, and EVA-CLIP produce these.

Object detection -- Bounding boxes and labels for objects in the frame. YOLO, DINO, and Grounding DINO are common choices.

OCR -- Text detected in the image. Critical for slides, documents, product labels, and street signs.

Face detection -- Location and identity of faces. Required for brand safety, compliance, and media asset management.

Audio Features

Transcription -- Speech-to-text output, ideally with timestamps and speaker labels. Whisper and its variants dominate here.

Audio fingerprinting -- A compact signature that identifies a specific audio recording regardless of format or bitrate. Used for music identification and content deduplication.

Audio classification -- Labels for non-speech sounds: laughter, applause, music, silence, background noise.

Document Features

Layout analysis -- Detecting headers, paragraphs, tables, figures, and captions within a page. This preserves document structure that naive text extraction destroys.

Table extraction -- Converting visual tables into structured data (rows and columns) that an LLM can reason over.

Figure captioning -- Generating text descriptions of charts, diagrams, and photographs embedded in documents.

The Embedding Alignment Problem

When you embed a video frame with CLIP and a transcript chunk with a text embedding model, the resulting vectors live in different spaces. A cosine similarity between them is meaningless.

Three approaches to solving this:

Shared-space models like CLIP, ImageBind, and Mixpeek's multimodal embeddings project all modalities into a single vector space. A text query and an image live in the same space, so cross-modal retrieval works directly.

Late fusion retrieves from each modality independently and combines results at the ranking stage. You search the video index, the transcript index, and the document index separately, then merge the result lists using reciprocal rank fusion or a learned re-ranker.

Cross-encoders take a query and a candidate (from any modality) and produce a relevance score directly. These are more accurate than bi-encoders but too slow for first-stage retrieval. Use them as re-rankers on the top-k results from a faster first stage.

# Multi-stage retrieval with Mixpeek: embed + rerank
retriever = client.retrievers.create(
    namespace_id=ns.namespace_id,
    retriever_name="multimodal-search",
    stages=[
        {
            "stage_type": "embedding",
            "model": "mixpeek-embed-v2",
            "limit": 100
        },
        {
            "stage_type": "rerank",
            "model": "mixpeek-rerank-v1",
            "limit": 10
        }
    ]
)

Stage 3: Indexing

Indexing is where extracted features become searchable. The storage layer must support:

Vector search for embedding-based retrieval (approximate nearest neighbor)

Full-text search for keyword matching on transcripts and extracted text

Metadata filtering for narrowing by date, file type, source, labels, or any extracted attribute

Hybrid queries that combine all three in a single request

Namespace Design

A namespace is a logical container for related vectors. Design your namespaces around query patterns, not organizational hierarchy:

By use case: brand-safety-assets, product-catalog, support-recordings

By modality: video-frames, transcripts, documents (useful when different modalities need different embedding models)

By tenant: tenant-acme, tenant-globex (required for multi-tenant SaaS applications)

Avoid mixing unrelated data in a single namespace. Retrieval quality degrades when the index contains semantically diverse content because the nearest neighbors become less meaningful.

Storage Tiering

Not all vectors need to be in hot storage. A production system should tier data by access frequency:

Tier

Storage

Latency

Cost

Use Case

Hot	In-memory vector DB (Qdrant, Pinecone)	<10ms	$$	Active search workloads
Warm	Disk-backed vector DB or S3 Vectors	50-200ms	$	Archives, infrequent queries
Cold	Object storage (S3) with on-demand loading	1-5s	$	Compliance retention, backup

Mixpeek manages tiering automatically: active namespaces stay in the hot tier, and data can be moved to warm or cold storage based on configurable policies. See the vector storage tiering guide for details.

Stage 4: Retrieval

Retrieval is the most nuanced stage. A naive "embed the query, find the nearest vectors" approach works for demos but fails in production for several reasons:

Why Single-Stage Retrieval Fails

1. Vocabulary mismatch. A user searching for "red sedan" will miss a video tagged with "crimson car." Semantic embeddings help but do not eliminate this entirely. 2. Modality mismatch. A text query cannot directly match against audio fingerprints or object detection bounding boxes. 3. Precision vs. recall tradeoff. Embedding search optimizes for recall (finding anything remotely relevant). Production use cases often need precision (finding exactly the right thing).

Multi-Stage Retrieval

The solution is a pipeline of retrieval stages, each refining the results of the previous one:

Stage 1: Broad recall. Use embedding search with a generous limit (100-500 candidates). This casts a wide net and ensures you do not miss relevant results.

Stage 2: Metadata filtering. Apply hard filters: date range, file type, source bucket, content labels, compliance flags. This eliminates candidates that are semantically similar but contextually irrelevant.

Stage 3: Re-ranking. Use a cross-encoder or learned re-ranker to score the remaining candidates against the query with higher fidelity. Cross-encoders attend to fine-grained interactions between query and document that bi-encoders miss.

Stage 4: Deduplication. Remove near-duplicate results (common when the same content appears in multiple formats or when overlapping video segments match).

# Full multi-stage retrieval pipeline
results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    query="person explaining quarterly revenue growth",
    filters={
        "file_type": {"$in": ["video/mp4", "video/webm"]},
        "created_at": {"$gte": "2026-01-01"}
    }
)

for doc in results.documents:
    print(f"{doc.score:.3f} | {doc.metadata['source_file']}")
    print(f"  Segment: {doc.metadata.get('start_time', 'N/A')}s")
    print(f"  Transcript: {doc.content[:200]}")

Hybrid Search

Combine vector similarity with keyword matching for the best of both worlds. This catches exact terminology (product names, model numbers, legal terms) that embeddings might conflate with semantically similar but incorrect matches.

# Hybrid retrieval: semantic + keyword
retriever = client.retrievers.create(
    namespace_id=ns.namespace_id,
    retriever_name="hybrid-search",
    stages=[
        {
            "stage_type": "hybrid",
            "semantic_weight": 0.7,
            "keyword_weight": 0.3,
            "limit": 50
        },
        {
            "stage_type": "rerank",
            "model": "mixpeek-rerank-v1",
            "limit": 10
        }
    ]
)

Stage 5: Generation

The final stage feeds retrieved context to an LLM to produce a response. The key decisions here are context formatting and prompt construction.

Context Formatting

LLMs process text. Multimodal context must be serialized into a format the model can reason over:

For video segments: Include the transcript text, a natural-language description of the visual content, detected objects and faces, and the timestamp range. Do not pass raw frames unless you are using a vision-language model with sufficient context window.

For images: Include the caption, OCR text, detected objects with bounding boxes (as text descriptions), and any relevant EXIF metadata.

For audio: Include the transcript with speaker labels and timestamps. Note any significant non-speech sounds.

For documents: Include the extracted text with section headings preserved. For tables, use markdown table format. For figures, include the generated caption.

Context Window Management

A single video can produce thousands of transcript words and hundreds of keyframe descriptions. You cannot pass all of this to the LLM. Strategies for fitting within the context window:

1. Rank and truncate. Only include the top-k most relevant chunks. Simple but loses context. 2. Summarize then retrieve. Pre-compute summaries at multiple granularities (segment, scene, full video) and retrieve at the appropriate level. 3. Hierarchical context. Include a high-level summary of all retrieved documents, plus full detail for the top 3-5.

Grounding and Attribution

Always include source references in the generated output. Users need to verify claims, and downstream systems need to link back to the original media:

Based on the Q3 earnings call recording (2026-07-15, timestamp 12:34-13:01),
the CFO stated that revenue grew 23% year-over-year, driven primarily by
enterprise contract expansion.

Sources:
- q3-earnings-call.mp4 [12:34-13:01] (transcript match, score: 0.94)
- q3-earnings-deck.pdf [page 7] (table match, score: 0.89)

Production Deployment Patterns

Pattern 1: Batch Ingestion + Real-Time Retrieval

The most common pattern. Files are ingested and processed in batch (hourly, daily, or on-upload), while retrieval and generation happen in real time.

Best for: Media asset management, video archives, document search, knowledge bases.

Architecture:

1. Files uploaded to object storage (S3, GCS) 2. Upload triggers processing pipeline (feature extraction, embedding, indexing) 3. User queries hit a retrieval API that searches the pre-built index 4. Retrieved context is passed to an LLM for generation

Pattern 2: Streaming Ingestion + Real-Time Retrieval

Content is processed as it arrives, with near-zero delay between ingestion and searchability.

Best for: Content moderation, live event monitoring, social media analysis.

Architecture:

1. Media stream (live video, social feed) is segmented in real time 2. Each segment is processed immediately (feature extraction + embedding) 3. Vectors are indexed with sub-second latency 4. Monitoring queries run continuously against the growing index

Pattern 3: Agent-Driven Retrieval

An AI agent decides what to search for, how to refine the query, and when it has enough context to answer. The retrieval pipeline is exposed as a tool the agent can call.

Best for: Complex research tasks, multi-step reasoning, autonomous workflows.

Architecture:

1. Agent receives a task (e.g., "Find all instances of our logo being used incorrectly") 2. Agent formulates an initial query and calls the retrieval tool 3. Agent examines results, refines the query, and retrieves again 4. Agent synthesizes findings into a report

Mixpeek's MCP server exposes retrieval as a tool that any MCP-compatible agent can call, making this pattern straightforward to implement.

Common Mistakes

Embedding everything with the same model. Different modalities benefit from specialized models. Using CLIP for text-heavy documents or a text encoder for product images leaves performance on the table.

Skipping the re-ranking stage. Bi-encoder retrieval is fast but approximate. A re-ranker consistently improves precision by 15-30% on multimodal benchmarks. The latency cost (50-100ms for 100 candidates) is worth it for nearly every production use case.

Ignoring chunk boundaries. A video segment that starts mid-sentence or an image crop that cuts off a product label produces low-quality features. Invest in intelligent segmentation.

Not retaining raw source data. If you only store embeddings, you cannot re-embed when better models become available. Always keep the original files alongside the vectors. See the embedding portability guide for migration strategies.

Treating all queries the same. A keyword-style query ("invoice Q3 2025") and a semantic query ("someone explaining why revenue dropped") require different retrieval strategies. Use query classification to route to the appropriate pipeline.

Measuring RAG Quality

You cannot improve what you do not measure. Three metrics matter for multimodal RAG:

Retrieval recall at k: Of the relevant documents in the corpus, what fraction appears in the top-k retrieved results? Measure this with a golden test set of queries with known relevant documents.

Answer faithfulness: Does the generated answer only contain claims supported by the retrieved context? Unfaithful answers (hallucinations) are the primary failure mode of RAG systems.

End-to-end latency: Time from query submission to response delivery. Break this down by stage (embedding, retrieval, re-ranking, generation) to identify bottlenecks.

# Build a test harness
test_queries = [
    {
        "query": "product recall announcement 2025",
        "expected_doc_ids": ["vid-8832", "doc-1204"],
    },
    {
        "query": "warehouse safety incident",
        "expected_doc_ids": ["vid-2291", "vid-2292", "doc-0887"],
    },
]

for test in test_queries:
    results = client.retrievers.execute(
        retriever_id=retriever.retriever_id,
        query=test["query"],
        limit=20
    )
    retrieved_ids = [d.document_id for d in results.documents]
    recall = len(set(test["expected_doc_ids"]) & set(retrieved_ids)) / len(test["expected_doc_ids"])
    print(f"Query: {test['query']} | Recall@20: {recall:.0%}")

Key Takeaways

Multimodal RAG extends text-only RAG with modality-specific chunking, feature extraction, and context formatting. The retrieval-augmented generation pattern itself is unchanged.

Ingestion is the hardest stage. File type detection, codec handling, temporal segmentation, and multi-signal extraction all require specialized infrastructure.

Multi-stage retrieval (broad recall, filtering, re-ranking, deduplication) consistently outperforms single-stage embedding search.

Store raw source data alongside vectors. You will need to re-embed when models improve.

Measure retrieval recall, answer faithfulness, and end-to-end latency. Build a golden test set before optimizing.

For agent-driven use cases, expose your retrieval pipeline as a tool (via MCP or function calling) rather than hardcoding query logic.

Related Resources

What Is a Multimodal Data Warehouse? -- foundational concepts

Build a Multimodal Data Warehouse -- hands-on implementation guide

Multimodal Data Warehouse Architecture -- reference architecture patterns

Vector Storage Tiering -- hot, warm, and cold storage management

Embedding Portability and Versioning -- managing model upgrades

MCP Tools for Multimodal AI Agents -- agent integration patterns

Multimodal RAG -- glossary definition

Retrieval-Augmented Generation -- glossary definition

Documentation -- getting started with Mixpeek