Multi-Index Search Architecture: How to Combine Visual, Audio, and Text Embeddings for Rich Media

The Single-Embedding Trap

The simplest multimodal search system uses one embedding model, typically CLIP, to map everything into a shared vector space. Query with text, match against image embeddings, rank by cosine similarity. It works, and for many use cases it is good enough.

But the moment your assets carry more than one information channel, a single embedding becomes a bottleneck. Consider a 30-second video ad. It contains:

Visual content: scenes, objects, faces, text overlays, brand logos

Audio content: speech, music, sound effects

Temporal structure: pacing, shot transitions, hook timing

Metadata: resolution, codec, duration, upload date

A single CLIP embedding captures some visual semantics but discards speech, ignores audio, flattens temporal structure, and compresses spatial relationships into a fixed-length vector. When an agent searches for "ads where someone mentions free shipping while holding a product," a single visual embedding cannot answer that query.

The solution is multi-index architecture: decompose each asset into multiple feature streams, store each stream in its own index, route queries to the relevant indexes, and fuse the results into a single ranked list.

The Decomposition Pattern

Multi-index search starts with feature extraction, running multiple specialized models over each asset during ingestion. Each model extracts a different "view" of the same content:

Feature Stream

Model Example

Output

What It Captures

Visual embedding	CLIP ViT-L/14	768-dim vector	Scene semantics, objects, style
Object detections	YOLO26	Bounding boxes + labels	What objects appear and where
Face identities	RetinaFace + ArcFace	512-dim face vectors	Who appears in the content
Transcript	Whisper large-v3	Text + timestamps	What was said and when
Audio embedding	CLAP	512-dim vector	Music genre, sound events, mood
Scene captions	Florence-2	Natural language	Dense description of visual content

The key insight is that these feature streams are independent. You do not need to force them into a shared embedding space. Each stream has its own dimensionality, its own similarity metric, and its own retrieval characteristics.

This independence is a feature, not a bug. It means you can:

1. Upgrade models independently: swap Whisper for a faster ASR model without reindexing visual embeddings 2. Add new modalities: add audio fingerprinting later without touching existing indexes 3. Tune retrieval per stream: use HNSW for dense vectors, BM25 for transcripts, exact match for face identities

Index Design: Separate vs. Fused

There are two fundamental architectures for storing multi-stream features:

Separate Indexes (Index-Per-Stream)

Each feature stream gets its own index. A video with 5 feature streams lives across 5 different indexes, linked by a shared asset ID.

Asset: video_abc123
  ├── visual_index:    [768-dim CLIP vector]
  ├── transcript_index: [BM25 inverted index over transcript text]
  ├── face_index:       [512-dim ArcFace vectors, one per detected face]
  ├── audio_index:      [512-dim CLAP vector]
  └── object_index:     [structured JSON: labels, bboxes, confidences]

Advantages:

Each index uses the optimal data structure for its modality (ANN for dense vectors, BM25 for text, structured filters for objects)

Independent scaling: the face index can be sharded differently than the visual index

Independent updates: re-extract transcripts without touching visual embeddings

Clear separation of concerns in code

Disadvantages:

Query execution requires fan-out to multiple indexes

Score fusion introduces complexity (more on this below)

Asset deletion must cascade across all indexes

Fused Space (Single Index)

Project all features into a single shared vector space using a model like Qwen3-VL-Embedding or ImageBind, then store everything in one index.

Advantages:

Simple query path: one embedding lookup, one ranking

No score fusion needed

Easier to reason about relevance

Disadvantages:

The shared space compresses modality-specific information

Upgrading the model means reindexing everything

Cannot use modality-specific retrieval strategies (BM25 for text, exact match for faces)

Quality ceiling is bounded by the fused model's ability to represent all modalities

The Pragmatic Middle Ground

Most production systems use a hybrid: fused embedding for coarse retrieval, separate indexes for modality-specific reranking and filtering.

Stage 1: Coarse retrieval via fused multimodal embedding (top 1000)
Stage 2: Parallel reranking from separate indexes (transcript, face, audio)
Stage 3: Score fusion into final ranked list (top 20)

This gives you the simplicity of a single first-stage query with the precision of modality-specific scoring.

Query Routing: Deciding Which Indexes to Search

Not every query needs every index. A text-only query like "quarterly revenue presentation" should hit the transcript index and maybe scene captions, but searching the audio embedding index for music similarity adds noise.

Query routing is the logic that decides which indexes to search for a given query. There are three approaches:

Rule-Based Routing

Parse the query for modality signals and route accordingly:

Query mentions a person's name → add face index

Query mentions a sound or music → add audio index

Query mentions visual attributes (color, object, scene) → add visual index

All queries → always include transcript and visual embedding as baseline

This is simple, interpretable, and covers 80% of cases. The rules encode domain knowledge: in a media archive, transcript search is almost always relevant, so it stays on by default.

Classifier-Based Routing

Train a lightweight classifier (or use an LLM) to predict which indexes are relevant:

Input: "find clips where the CEO discusses layoffs near a whiteboard"
Output: {transcript: 0.95, visual: 0.80, face: 0.70, audio: 0.10}

Indexes above a threshold (e.g., 0.5) get searched. This handles compositional queries better than rules but adds latency and requires training data.

Agent-Driven Routing

Give the AI agent access to each index as a tool. The agent decides which tools to call based on its reasoning:

Agent thinks: "The user wants clips of a specific person speaking.
  I need: face search (to find the person) + transcript search
  (to find speech about layoffs) + visual search (whiteboard)."
Agent calls: search_faces(), search_transcripts(), search_visual()

This is the most flexible approach and naturally handles novel queries, but it is slower (multiple LLM calls) and less deterministic.

Score Fusion: Combining Results from Multiple Indexes

When you search three indexes and get three ranked lists, you need to combine them into one. This is the score fusion problem, and getting it right is the difference between a search system that works and one that frustrates users.

The Score Incompatibility Problem

Scores from different indexes are not comparable:

CLIP cosine similarity: ranges from -1 to 1, typically 0.15-0.40

BM25 scores: unbounded positive numbers, highly variable

Face distance: typically L2 distance, lower is better

Object detection confidence: 0 to 1

You cannot average these directly. A BM25 score of 25 is not "better" than a cosine similarity of 0.35.

Reciprocal Rank Fusion (RRF)

RRF sidesteps the score normalization problem entirely by using only the rank position from each list:

RRF_score(doc) = sum( 1 / (k + rank_i(doc)) ) for each list i

Where k is a constant (typically 60) that dampens the impact of high-rank positions. A document ranked #1 in two lists gets a much higher RRF score than a document ranked #1 in one list and #100 in another.

Why RRF works well in practice:

No score normalization needed

No hyperparameter tuning (k=60 works across most domains)

Robust to missing results (a document absent from one list simply gets no contribution from that list)

Simple to implement

Where RRF falls short:

Treats all indexes as equally important

Ignores the actual scores: a document at rank #2 with a score of 0.99 is treated the same as one at rank #2 with a score of 0.51

Cannot learn domain-specific weighting

Weighted Linear Combination

Normalize scores from each index to [0, 1], then compute a weighted sum:

final_score(doc) = w_visual * norm(visual_score)
                 + w_transcript * norm(transcript_score)
                 + w_audio * norm(audio_score)

Normalization options:

Min-max: (score - min) / (max - min) over the result set

Z-score: (score - mean) / std then clip to [0, 1]

Rank-based: convert to percentile rank

The weights w_visual, w_transcript, w_audio encode how important each modality is for your domain. In a podcast search, transcript weight dominates. In a fashion catalog, visual weight dominates.

Tuning weights: Start with equal weights, then adjust based on relevance judgments. Even a small labeled set (50-100 queries with relevance labels) is enough to tune weights via grid search.

Learned Fusion

Train a model to predict relevance from the raw scores of each index:

Input features: [visual_score, visual_rank, transcript_score,
                 transcript_rank, audio_score, audio_rank,
                 face_match, object_count, ...]
Output: relevance score

This is typically a gradient-boosted tree (XGBoost/LightGBM) trained on click logs or human relevance judgments. It can learn non-linear interactions: "high visual score + high transcript score together is more relevant than either alone."

When to use learned fusion: When you have enough training data (1000+ labeled query-document pairs) and the domain is stable enough that a trained model generalizes.

Production Considerations

Latency Budget

Multi-index search adds fan-out latency. If you search 4 indexes in parallel, the total latency is bounded by the slowest index plus fusion overhead:

total_latency = max(visual_latency, transcript_latency,
                    audio_latency, face_latency)
              + fusion_time
              + overhead

Practical targets:

First-stage retrieval per index: 5-20ms (ANN search or BM25)

Fan-out overhead: 1-3ms

Score fusion: <1ms (RRF) or 2-5ms (learned)

Total p99: 30-50ms for 4 indexes searched in parallel

Storage Cost

Multiple embeddings per asset multiply storage linearly. A video that stores 5 feature streams at 768 dimensions each uses 5x the vector storage of a single-embedding system. At billion-document scale, this matters.

Mitigation strategies:

Quantization: INT8 or binary quantization reduces storage 4-32x per stream

Matryoshka dimensions: Use 256-dim instead of 768-dim for streams where precision matters less

Selective indexing: Not every asset needs every stream: index audio only for assets that have audio

Consistency

When an asset is deleted or updated, all indexes must be updated atomically. A stale face embedding pointing to a deleted video produces ghost results.

Solutions:

Soft delete with TTL: Mark the asset as deleted, let background cleanup remove index entries

Asset version column: Include a version number in each index entry; filter stale versions at query time

Transactional writes: Use a system that supports multi-index transactional updates

How This Works on Mixpeek

Mixpeek's pipeline architecture maps directly to the multi-index pattern. When you configure a pipeline with multiple extractors, each extractor produces a separate feature stream that gets stored in its own index:

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Each extractor below creates a separate searchable index
client.pipelines.create(
    alias="media-search",
    extractors=[
        {
            "extractor": "mixpeek://image_extractor@v1/openai_clip_large_v1",
            "output_key": "visual_embedding"
        },
        {
            "extractor": "mixpeek://transcription@v1/openai_whisper_large_v3",
            "output_key": "transcript"
        },
        {
            "extractor": "mixpeek://image_extractor@v1/facebook_dinov2_large_v1",
            "output_key": "scene_features"
        }
    ]
)

Mixpeek's multi-stage retrievers handle query routing and score fusion automatically. A retriever can search across multiple feature indexes in a single call, using RRF or weighted fusion:

results = client.retrievers.execute(
    retriever_id="media-search-retriever",
    query="person explaining a chart in a meeting room",
    pipeline=[
        {
            "stage_type": "search",
            "stage_id": "visual_search",
            "model": "mixpeek://image_extractor@v1/openai_clip_large_v1",
            "limit": 100
        },
        {
            "stage_type": "search",
            "stage_id": "transcript_search",
            "model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
            "limit": 100
        },
        {
            "stage_type": "fusion",
            "stage_id": "rrf",
            "method": "reciprocal_rank_fusion",
            "k": 60,
            "limit": 20
        }
    ]
)

The pipeline-level decomposition during ingestion and the multi-stage retriever during search are two sides of the same architecture: decompose on write, fuse on read.

Decision Framework

Question

Single Index

Multi-Index

Assets have one dominant modality?	Yes	Overkill
Queries span multiple modalities?	Struggles	Designed for this
Need to upgrade models independently?	Full reindex	Per-stream reindex
Storage budget is tight?	Lower cost	3-5x more vectors
Need sub-10ms latency?	Easier	Requires parallel fan-out
Team has search engineering expertise?	Not needed	Helpful for tuning

Start simple, add indexes when recall demands it. Begin with a fused multimodal embedding (CLIP, Qwen3-VL-Embedding) for v1. When users report missed results, "I know this video exists but search doesn't find it", add a modality-specific index for the feature type they are searching for. Each new index is an incremental improvement, not a rewrite.