Multimodal Data Warehouse Architecture Deep Dive

Architecture Overview

A multimodal data warehouse is organized into five layers, each responsible for a distinct part of the data lifecycle:

1. API Layer — FastAPI-based REST API that handles all client interactions: object ingestion, retrieval pipeline execution, taxonomy management, and namespace administration. 2. Task Queue Layer — Celery workers backed by Redis that orchestrate asynchronous processing: batch ingestion, feature extraction coordination, taxonomy application, and storage lifecycle transitions. 3. Inference Engine — Ray Serve cluster running 14+ model endpoints for distributed feature extraction. This is the computational core that transforms raw objects into queryable features. 4. Storage Layer — Tiered storage across Qdrant (hot vector index), S3 Vectors (warm canonical store), S3 (cold object storage), and MongoDB (metadata and configuration). 5. Query Layer — Multi-stage retrieval engine that executes composable pipelines across the storage layer.

Each layer scales independently. The API layer scales horizontally behind a load balancer. The task queue scales by adding workers. The inference engine autoscales GPU and CPU nodes based on queue depth. Storage tiers scale by capacity.

The Inference Engine

The inference engine is a Ray Serve cluster that hosts specialized model endpoints for every type of feature extraction the warehouse supports. Ray Serve provides:

Autoscaling — Replicas scale up under load and scale down during idle periods, optimizing GPU utilization.

Batching — Requests are batched transparently to maximize throughput on GPU hardware.

Model composition — Pipelines chain multiple models (e.g., scene detection followed by face extraction followed by embedding generation) without network round-trips.

Model Endpoints

The engine currently runs 14+ model endpoints:

Model

Type

Output

Use Case

ArcFace	Face embedding	512d vector	Face detection and recognition
SigLIP	Vision-language	768d vector	Logo and brand detection, visual search
CLAP	Audio-language	512d vector	Audio classification and search
Whisper	Speech-to-text	Text transcript	Transcription and spoken content search
YOLO	Object detection	Bounding boxes + labels	Logo, object, and scene element detection
AST	Audio embedding	768d vector	Audio fingerprinting
PANNs	Audio embedding	2048d vector	Environmental sound classification
LLMs	Text generation	Structured text	Taxonomy classification, summarization
Scene detection	Temporal segmentation	Timestamps	Video scene boundary detection
OCR	Text extraction	Text + bounding boxes	On-screen text detection
Preprocessing	Format conversion	Normalized media	Resampling, format conversion, normalization

Each model endpoint is independently versioned and can be updated without affecting other endpoints. Feature URIs encode the model version, ensuring reproducibility.

Feature Extraction Pipeline

When an object enters the warehouse, it goes through a multi-stage extraction pipeline:

1. Preprocessing

Raw objects are normalized into formats suitable for model inference:

Video — Decoded into frames at a configurable rate (default: 1 fps for analysis, scene-boundary-triggered for keyframes). Audio track is extracted separately.

Audio — Resampled to 16kHz mono, normalized to consistent volume levels.

Images — Resized and normalized to model-specific input dimensions.

Documents — Text extracted via OCR or direct parsing (PDF, DOCX).

2. Scene Detection (Video)

For video objects, scene detection runs first to identify visual boundaries. This splits a continuous video into coherent segments, each of which is processed independently. Scene detection uses a combination of histogram analysis and learned models to identify transitions.

3. Parallel Feature Extraction

Once an object is preprocessed (and optionally scene-split), feature extraction runs in parallel across all configured extractors. A single video might simultaneously have:

Face detection running on keyframes

Logo detection running on all frames

Audio fingerprinting running on the audio track

Whisper transcription running on the audio track

Visual embedding generation running on scene keyframes

Ray Serve's model composition handles the parallelism. Features from each extractor are collected, tagged with feature URIs, and written to the storage layer.

4. Feature URI Assignment

Every extracted feature receives a URI that encodes its provenance:

mixpeek://[email protected]/face_embedding?source=video_123&frame=1500&bbox=100,200,300,400

This URI tells you:

Which extractor produced the feature (arcface)

Which version of the extractor (v2.1)

What type of output (face_embedding)

The source object (video_123)

The exact location within the source (frame 1500, bounding box coordinates)

Feature URIs enable full lineage tracking: from a search result, you can trace back to the exact frame, timestamp, or audio segment in the original object.

Tiered Storage

The storage layer manages features across four tiers, each optimized for different access patterns:

Hot Tier: Qdrant

Qdrant serves as the hot vector index for real-time retrieval. Features that need sub-millisecond search latency live here. Qdrant provides:

HNSW index for approximate nearest neighbor search

Filtered search with payload-based conditions

Multi-vector support for objects with multiple feature types

Hot storage is the most expensive per GB but delivers the performance required for production search workloads.

Warm Tier: S3 Vectors

S3 Vectors is the canonical store for all features. Every feature extracted by the inference engine is written to S3 Vectors, regardless of whether it is also indexed in Qdrant. S3 Vectors provides:

Durable, versioned storage for all feature vectors

Batch retrieval for analytics and re-indexing workloads

Cost-effective storage at a fraction of Qdrant's per-GB cost

When a collection transitions from hot to warm, its features are removed from Qdrant but remain in S3 Vectors. They can be rehydrated to hot storage if query patterns change.

Cold Tier: S3

Raw source objects and infrequently accessed features are stored in S3. This tier provides:

Lowest cost per GB for bulk storage

Lifecycle policies for automatic transition to Glacier or Deep Archive

Source object retention for re-extraction if models are updated

Archive Tier: Metadata Only

Archived collections retain only their metadata (feature URIs, timestamps, provenance) in MongoDB. The actual vectors and source objects can be re-extracted from cold storage if needed. This tier is used for long-term compliance retention where the data must be recoverable but is never queried.

Lifecycle Management

Collections transition between tiers automatically based on configurable policies:

hot (0-30 days) -> warm (30-90 days) -> cold (90-365 days) -> archive (365+ days)

Transition triggers can be time-based, query-volume-based, or manual. The system tracks query patterns per collection and can recommend tier transitions based on actual usage.

Multi-Stage Retrieval Engine

The retrieval engine executes composable pipelines defined as sequences of typed stages. Each stage takes a set of documents as input and produces a (potentially modified) set of documents as output.

Stage Types

Stage Type

Purpose

Example

`filter`	Narrow the candidate set	Feature search, metadata filter, date range
`sort`	Rank candidates	Linear score combination, recency weighting
`reduce`	Collapse or sample results	Top-K, deduplication, clustering-based sampling
`enrich`	Augment with external data	Semantic join, metadata lookup, API call
`apply`	Transform results	Taxonomy classification, LLM summarization

Data Flow

Pipeline execution follows a strict data flow:

1. The first stage (typically a filter) retrieves an initial candidate set from the storage layer. 2. Each subsequent stage receives the output of the previous stage as input. 3. Stages can add, remove, or modify documents in the result set. 4. The final stage's output is returned to the client.

This design ensures that expensive operations (like LLM calls in an apply stage) only run on a small, pre-filtered result set rather than the entire corpus.

Optimization

The retrieval engine applies several optimizations:

Early termination — If a reduce stage reduces the result set to zero, subsequent stages are skipped.

Parallel execution — Independent stages (e.g., multiple filter stages that search different collections) can execute in parallel.

Caching — Intermediate results are cached for repeated queries with the same parameters.

Pushdown — Filter conditions are pushed down to the storage layer (Qdrant payload filters) to minimize data transfer.

The Semantic Join

The semantic join is implemented as the document_enrich stage. It connects documents from one collection with related documents from another collection using vector similarity.

How It Works

1. For each document in the current result set, the enrich stage extracts a feature vector. 2. That vector is used to search a secondary collection (the "source" collection). 3. Matching documents from the source collection are attached to the original document as enrichment metadata. 4. The enriched result set passes to the next stage.

Example: IP Safety Cross-Reference

A retrieval pipeline for IP safety might:

1. Filter — Search the face collection for faces similar to a query image. 2. Enrich — For each matched face, search the logo collection for logos that appear in the same source video. 3. Apply — Run an LLM to summarize the combined face + logo detections into a risk assessment.

This produces a report that says "Video X contains a face matching Celebrity Y at timestamp 00:15, and a branded logo matching Trademark Z at timestamp 00:18" — all from a single pipeline execution.

Taxonomy Engine

The taxonomy engine provides three modes of classification:

Materialized Taxonomies

Classification runs as part of the ingestion pipeline. When a new object is processed, the taxonomy engine:

1. Extracts the relevant feature (e.g., text transcript, visual embedding). 2. Runs classification against the defined categories (using an LLM or embedding-based classifier). 3. Stores the classification as metadata on the document.

Materialized taxonomies are fast at query time (the classification is pre-computed) but require re-processing when categories change.

On-Demand Taxonomies

Classification runs at query time as an apply stage in a retrieval pipeline. The taxonomy engine:

1. Receives documents from the previous pipeline stage. 2. Runs classification on each document. 3. Attaches the classification and passes documents to the next stage.

On-demand taxonomies are flexible (no pre-processing required) but add latency to queries.

Retroactive Taxonomies

Batch classification over historical data. When a new taxonomy is created with mode=retroactive, the system:

1. Identifies all existing documents in the target collection. 2. Schedules batch classification jobs through the task queue. 3. Updates document metadata with new classifications as jobs complete.

This enables evolving your classification scheme without re-ingesting source objects.

Scaling

Ray Serve Autoscaling

The inference engine autoscales based on:

Queue depth — When pending requests exceed a threshold, new replicas are launched.

GPU utilization — Replicas are added when GPU utilization exceeds 80% and removed when it drops below 20%.

Model-specific scaling — Each model endpoint scales independently. A spike in face detection requests does not affect Whisper transcription capacity.

Batch Processing

Large ingestion jobs are processed as batches:

1. Files are uploaded to a bucket. 2. The task queue splits the upload into chunks. 3. Each chunk is processed in parallel across available inference engine replicas. 4. Progress is tracked per-batch with percentage completion.

Distributed Inference

For large models or high-throughput workloads, Ray Serve supports:

Model parallelism — Large models are split across multiple GPUs.

Pipeline parallelism — Different stages of the extraction pipeline run on different nodes.

Data parallelism — Multiple replicas of the same model process different inputs simultaneously.

Comparison to Traditional Data Warehouses

Aspect

Traditional (Snowflake/BigQuery)

Multimodal (Mixpeek)

Data type	Structured (rows, columns)	Unstructured (video, audio, images, documents)
Ingestion	ETL / ELT	Object decomposition via inference engine
Schema	Defined upfront	Emergent via feature extraction and taxonomies
Query language	SQL	Multi-stage retrieval pipelines
Joins	Foreign key joins	Semantic joins (vector similarity)
Storage	Columnar	Tiered (hot vectors, warm S3 Vectors, cold S3)
Compute	Query engine (CPU)	Inference engine (GPU + CPU)
Scaling	Warehouse size (S/M/L/XL)	Per-model autoscaling
Schema evolution	ALTER TABLE / migrations	Retroactive taxonomies

The analogy is precise: a multimodal data warehouse is to unstructured data what Snowflake is to structured data. The primitives are different (features instead of columns, retrieval pipelines instead of SQL, semantic joins instead of foreign key joins), but the system-level goals are the same: ingest, store, and query data at scale with governance and cost management.

Learn more about Mixpeek's architecture at mixpeek.com/docs or explore the multimodal data warehouse page.

Related Resources

What Is a Multimodal Data Warehouse? — comprehensive overview of the category

How to Build a Multimodal Data Warehouse — step-by-step tutorial with Python SDK code

Multimodal Data Warehouse vs. Vector Database — full comparison

Multimodal Data Warehouse vs. Multimodal Database — system vs component

Best AI Data Warehouses (2026) — 7 platforms evaluated

Glossary: Multimodal Data Warehouse — technical definition

IP Safety Solution — see the warehouse powering pre-publication copyright detection