NEWManaged multimodal retrieval.Explore platform →
    Architecture
    18 min read
    Updated 2026-05-30

    Multi-Index Search Architecture: How to Combine Visual, Audio, and Text Embeddings for Rich Media

    A systems-design guide to building search over rich media by decomposing assets into multiple feature streams, storing them in separate indexes, routing queries, and fusing scores. Covers index-per-modality vs. fused-space design, RRF and weighted fusion, query routing heuristics, and production trade-offs.

    Architecture
    Search
    Multimodal
    Embeddings
    Score Fusion

    The Single-Embedding Trap



    The simplest multimodal search system uses one embedding model — typically CLIP — to map everything into a shared vector space. Query with text, match against image embeddings, rank by cosine similarity. It works, and for many use cases it is good enough.

    But the moment your assets carry more than one information channel, a single embedding becomes a bottleneck. Consider a 30-second video ad. It contains:

  1. Visual content: scenes, objects, faces, text overlays, brand logos
  2. Audio content: speech, music, sound effects
  3. Temporal structure: pacing, shot transitions, hook timing
  4. Metadata: resolution, codec, duration, upload date


  5. A single CLIP embedding captures some visual semantics but discards speech, ignores audio, flattens temporal structure, and compresses spatial relationships into a fixed-length vector. When an agent searches for "ads where someone mentions free shipping while holding a product," a single visual embedding cannot answer that query.

    The solution is multi-index architecture: decompose each asset into multiple feature streams, store each stream in its own index, route queries to the relevant indexes, and fuse the results into a single ranked list.

    The Decomposition Pattern



    Multi-index search starts with feature extraction — running multiple specialized models over each asset during ingestion. Each model extracts a different "view" of the same content:

    Feature StreamModel ExampleOutputWhat It Captures
    Visual embeddingCLIP ViT-L/14768-dim vectorScene semantics, objects, style
    Object detectionsYOLO26Bounding boxes + labelsWhat objects appear and where
    Face identitiesRetinaFace + ArcFace512-dim face vectorsWho appears in the content
    TranscriptWhisper large-v3Text + timestampsWhat was said and when
    Audio embeddingCLAP512-dim vectorMusic genre, sound events, mood
    Scene captionsFlorence-2Natural languageDense description of visual content
    The key insight is that these feature streams are independent. You do not need to force them into a shared embedding space. Each stream has its own dimensionality, its own similarity metric, and its own retrieval characteristics.

    This independence is a feature, not a bug. It means you can:

    1. Upgrade models independently — swap Whisper for a faster ASR model without reindexing visual embeddings 2. Add new modalities — add audio fingerprinting later without touching existing indexes 3. Tune retrieval per stream — use HNSW for dense vectors, BM25 for transcripts, exact match for face identities

    Index Design: Separate vs. Fused



    There are two fundamental architectures for storing multi-stream features:

    Separate Indexes (Index-Per-Stream)



    Each feature stream gets its own index. A video with 5 feature streams lives across 5 different indexes, linked by a shared asset ID.

    Asset: video_abc123
      ├── visual_index:    [768-dim CLIP vector]
      ├── transcript_index: [BM25 inverted index over transcript text]
      ├── face_index:       [512-dim ArcFace vectors, one per detected face]
      ├── audio_index:      [512-dim CLAP vector]
      └── object_index:     [structured JSON: labels, bboxes, confidences]
    


    Advantages:
  6. Each index uses the optimal data structure for its modality (ANN for dense vectors, BM25 for text, structured filters for objects)
  7. Independent scaling — the face index can be sharded differently than the visual index
  8. Independent updates — re-extract transcripts without touching visual embeddings
  9. Clear separation of concerns in code


  10. Disadvantages:
  11. Query execution requires fan-out to multiple indexes
  12. Score fusion introduces complexity (more on this below)
  13. Asset deletion must cascade across all indexes


  14. Fused Space (Single Index)



    Project all features into a single shared vector space using a model like Qwen3-VL-Embedding or ImageBind, then store everything in one index.

    Advantages:
  15. Simple query path — one embedding lookup, one ranking
  16. No score fusion needed
  17. Easier to reason about relevance


  18. Disadvantages:
  19. The shared space compresses modality-specific information
  20. Upgrading the model means reindexing everything
  21. Cannot use modality-specific retrieval strategies (BM25 for text, exact match for faces)
  22. Quality ceiling is bounded by the fused model's ability to represent all modalities


  23. The Pragmatic Middle Ground



    Most production systems use a hybrid: fused embedding for coarse retrieval, separate indexes for modality-specific reranking and filtering.

    Stage 1: Coarse retrieval via fused multimodal embedding (top 1000)
    Stage 2: Parallel reranking from separate indexes (transcript, face, audio)
    Stage 3: Score fusion into final ranked list (top 20)
    


    This gives you the simplicity of a single first-stage query with the precision of modality-specific scoring.

    Query Routing: Deciding Which Indexes to Search



    Not every query needs every index. A text-only query like "quarterly revenue presentation" should hit the transcript index and maybe scene captions, but searching the audio embedding index for music similarity adds noise.

    Query routing is the logic that decides which indexes to search for a given query. There are three approaches:

    Rule-Based Routing



    Parse the query for modality signals and route accordingly:

  24. Query mentions a person's name → add face index
  25. Query mentions a sound or music → add audio index
  26. Query mentions visual attributes (color, object, scene) → add visual index
  27. All queries → always include transcript and visual embedding as baseline


  28. This is simple, interpretable, and covers 80% of cases. The rules encode domain knowledge: in a media archive, transcript search is almost always relevant, so it stays on by default.

    Classifier-Based Routing



    Train a lightweight classifier (or use an LLM) to predict which indexes are relevant:

    Input: "find clips where the CEO discusses layoffs near a whiteboard"
    Output: {transcript: 0.95, visual: 0.80, face: 0.70, audio: 0.10}
    


    Indexes above a threshold (e.g., 0.5) get searched. This handles compositional queries better than rules but adds latency and requires training data.

    Agent-Driven Routing



    Give the AI agent access to each index as a tool. The agent decides which tools to call based on its reasoning:

    Agent thinks: "The user wants clips of a specific person speaking.
      I need: face search (to find the person) + transcript search
      (to find speech about layoffs) + visual search (whiteboard)."
    Agent calls: search_faces(), search_transcripts(), search_visual()
    


    This is the most flexible approach and naturally handles novel queries, but it is slower (multiple LLM calls) and less deterministic.

    Score Fusion: Combining Results from Multiple Indexes



    When you search three indexes and get three ranked lists, you need to combine them into one. This is the score fusion problem, and getting it right is the difference between a search system that works and one that frustrates users.

    The Score Incompatibility Problem



    Scores from different indexes are not comparable:

  29. CLIP cosine similarity: ranges from -1 to 1, typically 0.15–0.40
  30. BM25 scores: unbounded positive numbers, highly variable
  31. Face distance: typically L2 distance, lower is better
  32. Object detection confidence: 0 to 1


  33. You cannot average these directly. A BM25 score of 25 is not "better" than a cosine similarity of 0.35.

    Reciprocal Rank Fusion (RRF)



    RRF sidesteps the score normalization problem entirely by using only the rank position from each list:

    RRF_score(doc) = sum( 1 / (k + rank_i(doc)) ) for each list i
    


    Where `k` is a constant (typically 60) that dampens the impact of high-rank positions. A document ranked #1 in two lists gets a much higher RRF score than a document ranked #1 in one list and #100 in another.

    Why RRF works well in practice:
  34. No score normalization needed
  35. No hyperparameter tuning (k=60 works across most domains)
  36. Robust to missing results (a document absent from one list simply gets no contribution from that list)
  37. Simple to implement


  38. Where RRF falls short:
  39. Treats all indexes as equally important
  40. Ignores the actual scores — a document at rank #2 with a score of 0.99 is treated the same as one at rank #2 with a score of 0.51
  41. Cannot learn domain-specific weighting


  42. Weighted Linear Combination



    Normalize scores from each index to [0, 1], then compute a weighted sum:

    final_score(doc) = w_visual * norm(visual_score)
                     + w_transcript * norm(transcript_score)
                     + w_audio * norm(audio_score)
    


    Normalization options:
  43. Min-max: `(score - min) / (max - min)` over the result set
  44. Z-score: `(score - mean) / std` then clip to [0, 1]
  45. Rank-based: convert to percentile rank


  46. The weights `w_visual`, `w_transcript`, `w_audio` encode how important each modality is for your domain. In a podcast search, transcript weight dominates. In a fashion catalog, visual weight dominates.

    Tuning weights: Start with equal weights, then adjust based on relevance judgments. Even a small labeled set (50–100 queries with relevance labels) is enough to tune weights via grid search.

    Learned Fusion



    Train a model to predict relevance from the raw scores of each index:

    Input features: [visual_score, visual_rank, transcript_score,
                     transcript_rank, audio_score, audio_rank,
                     face_match, object_count, ...]
    Output: relevance score
    


    This is typically a gradient-boosted tree (XGBoost/LightGBM) trained on click logs or human relevance judgments. It can learn non-linear interactions: "high visual score + high transcript score together is more relevant than either alone."

    When to use learned fusion: When you have enough training data (1000+ labeled query-document pairs) and the domain is stable enough that a trained model generalizes.

    Production Considerations



    Latency Budget



    Multi-index search adds fan-out latency. If you search 4 indexes in parallel, the total latency is bounded by the slowest index plus fusion overhead:

    total_latency = max(visual_latency, transcript_latency,
                        audio_latency, face_latency)
                  + fusion_time
                  + overhead
    


    Practical targets:
  47. First-stage retrieval per index: 5–20ms (ANN search or BM25)
  48. Fan-out overhead: 1–3ms
  49. Score fusion: <1ms (RRF) or 2–5ms (learned)
  50. Total p99: 30–50ms for 4 indexes searched in parallel


  51. Storage Cost



    Multiple embeddings per asset multiply storage linearly. A video that stores 5 feature streams at 768 dimensions each uses 5x the vector storage of a single-embedding system. At billion-document scale, this matters.

    Mitigation strategies:
  52. Quantization: INT8 or binary quantization reduces storage 4–32x per stream
  53. Matryoshka dimensions: Use 256-dim instead of 768-dim for streams where precision matters less
  54. Selective indexing: Not every asset needs every stream — index audio only for assets that have audio


  55. Consistency



    When an asset is deleted or updated, all indexes must be updated atomically. A stale face embedding pointing to a deleted video produces ghost results.

    Solutions:
  56. Soft delete with TTL: Mark the asset as deleted, let background cleanup remove index entries
  57. Asset version column: Include a version number in each index entry; filter stale versions at query time
  58. Transactional writes: Use a system that supports multi-index transactional updates


  59. How This Works on Mixpeek



    Mixpeek's pipeline architecture maps directly to the multi-index pattern. When you configure a pipeline with multiple extractors, each extractor produces a separate feature stream that gets stored in its own index:

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="YOUR_API_KEY")

    # Each extractor below creates a separate searchable index client.pipelines.create( alias="media-search", extractors=[ { "extractor": "mixpeek://image_extractor@v1/openai_clip_large_v1", "output_key": "visual_embedding" }, { "extractor": "mixpeek://transcription@v1/openai_whisper_large_v3", "output_key": "transcript" }, { "extractor": "mixpeek://image_extractor@v1/facebook_dinov2_large_v1", "output_key": "scene_features" } ] )


    Mixpeek's multi-stage retrievers handle query routing and score fusion automatically. A retriever can search across multiple feature indexes in a single call, using RRF or weighted fusion:

    results = client.retrievers.search(
        retriever_id="media-search-retriever",
        query="person explaining a chart in a meeting room",
        pipeline=[
            {
                "stage_type": "search",
                "stage_id": "visual_search",
                "model": "mixpeek://image_extractor@v1/openai_clip_large_v1",
                "limit": 100
            },
            {
                "stage_type": "search",
                "stage_id": "transcript_search",
                "model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
                "limit": 100
            },
            {
                "stage_type": "fusion",
                "stage_id": "rrf",
                "method": "reciprocal_rank_fusion",
                "k": 60,
                "limit": 20
            }
        ]
    )
    


    The pipeline-level decomposition during ingestion and the multi-stage retriever during search are two sides of the same architecture: decompose on write, fuse on read.

    Decision Framework



    QuestionSingle IndexMulti-Index
    Assets have one dominant modality?YesOverkill
    Queries span multiple modalities?StrugglesDesigned for this
    Need to upgrade models independently?Full reindexPer-stream reindex
    Storage budget is tight?Lower cost3-5x more vectors
    Need sub-10ms latency?EasierRequires parallel fan-out
    Team has search engineering expertise?Not neededHelpful for tuning
    Start simple, add indexes when recall demands it. Begin with a fused multimodal embedding (CLIP, Qwen3-VL-Embedding) for v1. When users report missed results — "I know this video exists but search doesn't find it" — add a modality-specific index for the feature type they are searching for. Each new index is an incremental improvement, not a rewrite.

    Further Reading



  60. Multi-Stage Retrieval — how agents chain coarse and fine retrieval stages
  61. Omnimodal Embeddings — the fused-space alternative to multi-index
  62. Cross-Encoder Reranking — using rerankers as the fusion stage
  63. Multimodal Chunking Strategies — how to decompose assets into searchable units before indexing
  64. Evaluating Multimodal Retrieval — metrics and benchmarks for measuring multi-index search quality
  65. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs