Architecture
    25 min read
    Updated 2026-04-13

    How to Build a Multimodal RAG Pipeline

    A practical guide to retrieval-augmented generation across video, images, audio, and documents. Covers chunking strategies, embedding selection, retriever design, and production deployment patterns.

    RAG
    Multimodal
    Architecture
    Retrieval

    Why Multimodal RAG Is Hard



    Text-only RAG is well understood: chunk documents, embed them, retrieve the top-k, feed them to an LLM. But most enterprise data is not text. It is video recordings, product images, audio calls, PDFs with diagrams, and slide decks with screenshots. A text-only RAG pipeline ignores 80% of the information an organization actually has.

    The challenge with multimodal RAG is not conceptual. The retrieval-augmented generation pattern applies regardless of modality. The challenge is operational:

  1. Different modalities require different chunking strategies. A 45-minute video needs temporal segmentation. A product image needs region-of-interest detection. A podcast needs speaker-diarized transcript segments. You cannot apply a single text splitter to all of these.
  2. Embedding spaces are modality-specific. A CLIP embedding for an image frame lives in a different vector space than a Whisper-derived text embedding from the same video's audio track. Retrieval across these spaces requires alignment or multi-stage fusion.
  3. Context windows have hard limits. You cannot pass a raw video file to an LLM. You need to extract the right signals (keyframes, transcript segments, detected objects) and present them as structured context that fits within the model's input budget.
  4. Latency budgets vary by use case. A compliance scan that runs overnight can afford exhaustive multi-pass retrieval. A customer-facing search must return results in under 500ms. The same pipeline architecture does not serve both.


  5. This guide walks through each stage of a multimodal RAG pipeline, from ingestion to generation, with concrete implementation patterns.

    The Five Stages of Multimodal RAG



    Every multimodal RAG system, regardless of scale, follows five stages:

    1. Ingest -- Get raw files into the system and normalize them 2. Perceive -- Extract features, embeddings, and metadata from each modality 3. Index -- Store extracted representations for fast retrieval 4. Retrieve -- Find the most relevant pieces given a query 5. Generate -- Synthesize a response using retrieved context

    The rest of this guide covers each stage in detail.

    Stage 1: Ingestion and Normalization



    Ingestion is where most teams underestimate complexity. Raw media files come in dozens of formats, codecs, resolutions, and encodings. A production pipeline needs to handle all of them without manual intervention.

    File Type Detection



    Do not trust file extensions. A file named `report.pdf` might be a scanned image masquerading as a PDF. A `.mp4` might use an unsupported codec. Always detect the actual file type from the binary header:

    import magic

    mime = magic.from_file("upload.pdf", mime=True) # "application/pdf" -- genuine PDF # "image/jpeg" -- scanned image with wrong extension


    Modality Routing



    Once you know the true file type, route to the appropriate processing pipeline:

    MIME Type PatternModalityProcessing Path
    video/*VideoTemporal segmentation + frame extraction + audio extraction
    image/*ImageRegion detection + captioning + OCR
    audio/*AudioTranscription + speaker diarization + audio fingerprinting
    application/pdfDocumentPage extraction + layout analysis + table detection
    text/*TextChunking + entity extraction

    Temporal Segmentation for Video



    Video is the most complex modality because it contains multiple signals: visual frames, audio, speech, on-screen text, and motion. The first step is breaking a long video into semantically coherent segments.

    Scene-based segmentation detects visual transitions (cuts, fades, dissolves) and splits the video at those boundaries. This works well for edited content like movies, commercials, and news broadcasts.

    Fixed-window segmentation splits the video into equal-length chunks (e.g., 10-second windows with 2-second overlap). This is simpler and works for surveillance footage, webcam recordings, and other unedited content.

    Speech-based segmentation uses voice activity detection and speaker diarization to split at natural pause points. This is ideal for meetings, interviews, and podcasts where the visual track is secondary.

    # Mixpeek handles segmentation automatically based on collection config
    collection = client.collections.create(
        namespace_id=ns.namespace_id,
        collection_name="video-archive",
        feature_extractors=[{
            "extractor_type": "multimodal_embedding",
            "model": "mixpeek-embed-v2",
            "config": {
                "chunk_strategy": "scene",
                "chunk_duration_seconds": 15,
                "chunk_overlap_seconds": 2
            }
        }]
    )
    


    Stage 2: Feature Extraction (Perceive)



    Feature extraction converts raw media into searchable representations. Each modality produces different types of features:

    Visual Features



  6. Dense embeddings -- A single vector (typically 512-1024 dimensions) that captures the semantic content of an image or video frame. Models like CLIP, SigLIP, and EVA-CLIP produce these.
  7. Object detection -- Bounding boxes and labels for objects in the frame. YOLO, DINO, and Grounding DINO are common choices.
  8. OCR -- Text detected in the image. Critical for slides, documents, product labels, and street signs.
  9. Face detection -- Location and identity of faces. Required for brand safety, compliance, and media asset management.


  10. Audio Features



  11. Transcription -- Speech-to-text output, ideally with timestamps and speaker labels. Whisper and its variants dominate here.
  12. Audio fingerprinting -- A compact signature that identifies a specific audio recording regardless of format or bitrate. Used for music identification and content deduplication.
  13. Audio classification -- Labels for non-speech sounds: laughter, applause, music, silence, background noise.


  14. Document Features



  15. Layout analysis -- Detecting headers, paragraphs, tables, figures, and captions within a page. This preserves document structure that naive text extraction destroys.
  16. Table extraction -- Converting visual tables into structured data (rows and columns) that an LLM can reason over.
  17. Figure captioning -- Generating text descriptions of charts, diagrams, and photographs embedded in documents.


  18. The Embedding Alignment Problem



    When you embed a video frame with CLIP and a transcript chunk with a text embedding model, the resulting vectors live in different spaces. A cosine similarity between them is meaningless.

    Three approaches to solving this:

    Shared-space models like CLIP, ImageBind, and Mixpeek's multimodal embeddings project all modalities into a single vector space. A text query and an image live in the same space, so cross-modal retrieval works directly.

    Late fusion retrieves from each modality independently and combines results at the ranking stage. You search the video index, the transcript index, and the document index separately, then merge the result lists using reciprocal rank fusion or a learned re-ranker.

    Cross-encoders take a query and a candidate (from any modality) and produce a relevance score directly. These are more accurate than bi-encoders but too slow for first-stage retrieval. Use them as re-rankers on the top-k results from a faster first stage.

    # Multi-stage retrieval with Mixpeek: embed + rerank
    retriever = client.retrievers.create(
        namespace_id=ns.namespace_id,
        retriever_name="multimodal-search",
        stages=[
            {
                "stage_type": "embedding",
                "model": "mixpeek-embed-v2",
                "limit": 100
            },
            {
                "stage_type": "rerank",
                "model": "mixpeek-rerank-v1",
                "limit": 10
            }
        ]
    )
    


    Stage 3: Indexing



    Indexing is where extracted features become searchable. The storage layer must support:

  19. Vector search for embedding-based retrieval (approximate nearest neighbor)
  20. Full-text search for keyword matching on transcripts and extracted text
  21. Metadata filtering for narrowing by date, file type, source, labels, or any extracted attribute
  22. Hybrid queries that combine all three in a single request


  23. Namespace Design



    A namespace is a logical container for related vectors. Design your namespaces around query patterns, not organizational hierarchy:

    By use case: `brand-safety-assets`, `product-catalog`, `support-recordings`

    By modality: `video-frames`, `transcripts`, `documents` (useful when different modalities need different embedding models)

    By tenant: `tenant-acme`, `tenant-globex` (required for multi-tenant SaaS applications)

    Avoid mixing unrelated data in a single namespace. Retrieval quality degrades when the index contains semantically diverse content because the nearest neighbors become less meaningful.

    Storage Tiering



    Not all vectors need to be in hot storage. A production system should tier data by access frequency:

    TierStorageLatencyCostUse Case
    HotIn-memory vector DB (Qdrant, Pinecone)<10ms$$Active search workloads
    WarmDisk-backed vector DB or S3 Vectors50-200ms$Archives, infrequent queries
    ColdObject storage (S3) with on-demand loading1-5s$Compliance retention, backup
    Mixpeek manages tiering automatically: active namespaces stay in the hot tier, and data can be moved to warm or cold storage based on configurable policies. See the vector storage tiering guide for details.

    Stage 4: Retrieval



    Retrieval is the most nuanced stage. A naive "embed the query, find the nearest vectors" approach works for demos but fails in production for several reasons:

    Why Single-Stage Retrieval Fails



    1. Vocabulary mismatch. A user searching for "red sedan" will miss a video tagged with "crimson car." Semantic embeddings help but do not eliminate this entirely. 2. Modality mismatch. A text query cannot directly match against audio fingerprints or object detection bounding boxes. 3. Precision vs. recall tradeoff. Embedding search optimizes for recall (finding anything remotely relevant). Production use cases often need precision (finding exactly the right thing).

    Multi-Stage Retrieval



    The solution is a pipeline of retrieval stages, each refining the results of the previous one:

    Stage 1: Broad recall. Use embedding search with a generous limit (100-500 candidates). This casts a wide net and ensures you do not miss relevant results.

    Stage 2: Metadata filtering. Apply hard filters: date range, file type, source bucket, content labels, compliance flags. This eliminates candidates that are semantically similar but contextually irrelevant.

    Stage 3: Re-ranking. Use a cross-encoder or learned re-ranker to score the remaining candidates against the query with higher fidelity. Cross-encoders attend to fine-grained interactions between query and document that bi-encoders miss.

    Stage 4: Deduplication. Remove near-duplicate results (common when the same content appears in multiple formats or when overlapping video segments match).

    # Full multi-stage retrieval pipeline
    results = client.retrievers.execute(
        retriever_id=retriever.retriever_id,
        query="person explaining quarterly revenue growth",
        filters={
            "file_type": {"$in": ["video/mp4", "video/webm"]},
            "created_at": {"$gte": "2026-01-01"}
        }
    )

    for doc in results.documents: print(f"{doc.score:.3f} | {doc.metadata['source_file']}") print(f" Segment: {doc.metadata.get('start_time', 'N/A')}s") print(f" Transcript: {doc.content[:200]}")


    Hybrid Search



    Combine vector similarity with keyword matching for the best of both worlds. This catches exact terminology (product names, model numbers, legal terms) that embeddings might conflate with semantically similar but incorrect matches.

    # Hybrid retrieval: semantic + keyword
    retriever = client.retrievers.create(
        namespace_id=ns.namespace_id,
        retriever_name="hybrid-search",
        stages=[
            {
                "stage_type": "hybrid",
                "semantic_weight": 0.7,
                "keyword_weight": 0.3,
                "limit": 50
            },
            {
                "stage_type": "rerank",
                "model": "mixpeek-rerank-v1",
                "limit": 10
            }
        ]
    )
    


    Stage 5: Generation



    The final stage feeds retrieved context to an LLM to produce a response. The key decisions here are context formatting and prompt construction.

    Context Formatting



    LLMs process text. Multimodal context must be serialized into a format the model can reason over:

    For video segments: Include the transcript text, a natural-language description of the visual content, detected objects and faces, and the timestamp range. Do not pass raw frames unless you are using a vision-language model with sufficient context window.

    For images: Include the caption, OCR text, detected objects with bounding boxes (as text descriptions), and any relevant EXIF metadata.

    For audio: Include the transcript with speaker labels and timestamps. Note any significant non-speech sounds.

    For documents: Include the extracted text with section headings preserved. For tables, use markdown table format. For figures, include the generated caption.

    Context Window Management



    A single video can produce thousands of transcript words and hundreds of keyframe descriptions. You cannot pass all of this to the LLM. Strategies for fitting within the context window:

    1. Rank and truncate. Only include the top-k most relevant chunks. Simple but loses context. 2. Summarize then retrieve. Pre-compute summaries at multiple granularities (segment, scene, full video) and retrieve at the appropriate level. 3. Hierarchical context. Include a high-level summary of all retrieved documents, plus full detail for the top 3-5.

    Grounding and Attribution



    Always include source references in the generated output. Users need to verify claims, and downstream systems need to link back to the original media:

    Based on the Q3 earnings call recording (2026-07-15, timestamp 12:34-13:01),
    the CFO stated that revenue grew 23% year-over-year, driven primarily by
    enterprise contract expansion.

    Sources:
  24. q3-earnings-call.mp4 [12:34-13:01] (transcript match, score: 0.94)
  25. q3-earnings-deck.pdf [page 7] (table match, score: 0.89)


  26. Production Deployment Patterns



    Pattern 1: Batch Ingestion + Real-Time Retrieval



    The most common pattern. Files are ingested and processed in batch (hourly, daily, or on-upload), while retrieval and generation happen in real time.

    Best for: Media asset management, video archives, document search, knowledge bases.

    Architecture:

    1. Files uploaded to object storage (S3, GCS) 2. Upload triggers processing pipeline (feature extraction, embedding, indexing) 3. User queries hit a retrieval API that searches the pre-built index 4. Retrieved context is passed to an LLM for generation

    Pattern 2: Streaming Ingestion + Real-Time Retrieval



    Content is processed as it arrives, with near-zero delay between ingestion and searchability.

    Best for: Content moderation, live event monitoring, social media analysis.

    Architecture:

    1. Media stream (live video, social feed) is segmented in real time 2. Each segment is processed immediately (feature extraction + embedding) 3. Vectors are indexed with sub-second latency 4. Monitoring queries run continuously against the growing index

    Pattern 3: Agent-Driven Retrieval



    An AI agent decides what to search for, how to refine the query, and when it has enough context to answer. The retrieval pipeline is exposed as a tool the agent can call.

    Best for: Complex research tasks, multi-step reasoning, autonomous workflows.

    Architecture:

    1. Agent receives a task (e.g., "Find all instances of our logo being used incorrectly") 2. Agent formulates an initial query and calls the retrieval tool 3. Agent examines results, refines the query, and retrieves again 4. Agent synthesizes findings into a report

    Mixpeek's MCP server exposes retrieval as a tool that any MCP-compatible agent can call, making this pattern straightforward to implement.

    Common Mistakes



    Embedding everything with the same model. Different modalities benefit from specialized models. Using CLIP for text-heavy documents or a text encoder for product images leaves performance on the table.

    Skipping the re-ranking stage. Bi-encoder retrieval is fast but approximate. A re-ranker consistently improves precision by 15-30% on multimodal benchmarks. The latency cost (50-100ms for 100 candidates) is worth it for nearly every production use case.

    Ignoring chunk boundaries. A video segment that starts mid-sentence or an image crop that cuts off a product label produces low-quality features. Invest in intelligent segmentation.

    Not retaining raw source data. If you only store embeddings, you cannot re-embed when better models become available. Always keep the original files alongside the vectors. See the embedding portability guide for migration strategies.

    Treating all queries the same. A keyword-style query ("invoice Q3 2025") and a semantic query ("someone explaining why revenue dropped") require different retrieval strategies. Use query classification to route to the appropriate pipeline.

    Measuring RAG Quality



    You cannot improve what you do not measure. Three metrics matter for multimodal RAG:

    Retrieval recall at k: Of the relevant documents in the corpus, what fraction appears in the top-k retrieved results? Measure this with a golden test set of queries with known relevant documents.

    Answer faithfulness: Does the generated answer only contain claims supported by the retrieved context? Unfaithful answers (hallucinations) are the primary failure mode of RAG systems.

    End-to-end latency: Time from query submission to response delivery. Break this down by stage (embedding, retrieval, re-ranking, generation) to identify bottlenecks.

    # Build a test harness
    test_queries = [
        {
            "query": "product recall announcement 2025",
            "expected_doc_ids": ["vid-8832", "doc-1204"],
        },
        {
            "query": "warehouse safety incident",
            "expected_doc_ids": ["vid-2291", "vid-2292", "doc-0887"],
        },
    ]

    for test in test_queries: results = client.retrievers.execute( retriever_id=retriever.retriever_id, query=test["query"], limit=20 ) retrieved_ids = [d.document_id for d in results.documents] recall = len(set(test["expected_doc_ids"]) & set(retrieved_ids)) / len(test["expected_doc_ids"]) print(f"Query: {test['query']} | Recall@20: {recall:.0%}")


    Key Takeaways



  27. Multimodal RAG extends text-only RAG with modality-specific chunking, feature extraction, and context formatting. The retrieval-augmented generation pattern itself is unchanged.
  28. Ingestion is the hardest stage. File type detection, codec handling, temporal segmentation, and multi-signal extraction all require specialized infrastructure.
  29. Multi-stage retrieval (broad recall, filtering, re-ranking, deduplication) consistently outperforms single-stage embedding search.
  30. Store raw source data alongside vectors. You will need to re-embed when models improve.
  31. Measure retrieval recall, answer faithfulness, and end-to-end latency. Build a golden test set before optimizing.
  32. For agent-driven use cases, expose your retrieval pipeline as a tool (via MCP or function calling) rather than hardcoding query logic.


  33. Related Resources



  34. What Is a Multimodal Data Warehouse? -- foundational concepts
  35. Build a Multimodal Data Warehouse -- hands-on implementation guide
  36. Multimodal Data Warehouse Architecture -- reference architecture patterns
  37. Vector Storage Tiering -- hot, warm, and cold storage management
  38. Embedding Portability and Versioning -- managing model upgrades
  39. MCP Tools for Multimodal AI Agents -- agent integration patterns
  40. Multimodal RAG -- glossary definition
  41. Retrieval-Augmented Generation -- glossary definition
  42. Documentation -- getting started with Mixpeek
  43. Automate Copyright Detection

    Stop checking content manually. Mixpeek scans images, video, and audio for IP conflicts in seconds.

    Try Copyright CheckLearn About IP Safety