Architecture Overview
A multimodal data warehouse is organized into five layers, each responsible for a distinct part of the data lifecycle:
1. API Layer — FastAPI-based REST API that handles all client interactions: object ingestion, retrieval pipeline execution, taxonomy management, and namespace administration. 2. Task Queue Layer — Celery workers backed by Redis that orchestrate asynchronous processing: batch ingestion, feature extraction coordination, taxonomy application, and storage lifecycle transitions. 3. Inference Engine — Ray Serve cluster running 14+ model endpoints for distributed feature extraction. This is the computational core that transforms raw objects into queryable features. 4. Storage Layer — Tiered storage across Qdrant (hot vector index), S3 Vectors (warm canonical store), S3 (cold object storage), and MongoDB (metadata and configuration). 5. Query Layer — Multi-stage retrieval engine that executes composable pipelines across the storage layer.
Each layer scales independently. The API layer scales horizontally behind a load balancer. The task queue scales by adding workers. The inference engine autoscales GPU and CPU nodes based on queue depth. Storage tiers scale by capacity.
The Inference Engine
The inference engine is a Ray Serve cluster that hosts specialized model endpoints for every type of feature extraction the warehouse supports. Ray Serve provides:
Model Endpoints
The engine currently runs 14+ model endpoints:
| Model | Type | Output | Use Case |
| ArcFace | Face embedding | 512d vector | Face detection and recognition |
| SigLIP | Vision-language | 768d vector | Logo and brand detection, visual search |
| CLAP | Audio-language | 512d vector | Audio classification and search |
| Whisper | Speech-to-text | Text transcript | Transcription and spoken content search |
| YOLO | Object detection | Bounding boxes + labels | Logo, object, and scene element detection |
| AST | Audio embedding | 768d vector | Audio fingerprinting |
| PANNs | Audio embedding | 2048d vector | Environmental sound classification |
| LLMs | Text generation | Structured text | Taxonomy classification, summarization |
| Scene detection | Temporal segmentation | Timestamps | Video scene boundary detection |
| OCR | Text extraction | Text + bounding boxes | On-screen text detection |
| Preprocessing | Format conversion | Normalized media | Resampling, format conversion, normalization |
Feature Extraction Pipeline
When an object enters the warehouse, it goes through a multi-stage extraction pipeline:
1. Preprocessing
Raw objects are normalized into formats suitable for model inference:
2. Scene Detection (Video)
For video objects, scene detection runs first to identify visual boundaries. This splits a continuous video into coherent segments, each of which is processed independently. Scene detection uses a combination of histogram analysis and learned models to identify transitions.
3. Parallel Feature Extraction
Once an object is preprocessed (and optionally scene-split), feature extraction runs in parallel across all configured extractors. A single video might simultaneously have:
Ray Serve's model composition handles the parallelism. Features from each extractor are collected, tagged with feature URIs, and written to the storage layer.
4. Feature URI Assignment
Every extracted feature receives a URI that encodes its provenance:
mixpeek://[email protected]/face_embedding?source=video_123&frame=1500&bbox=100,200,300,400
This URI tells you:
Feature URIs enable full lineage tracking: from a search result, you can trace back to the exact frame, timestamp, or audio segment in the original object.
Tiered Storage
The storage layer manages features across four tiers, each optimized for different access patterns:
Hot Tier: Qdrant
Qdrant serves as the hot vector index for real-time retrieval. Features that need sub-millisecond search latency live here. Qdrant provides:
Hot storage is the most expensive per GB but delivers the performance required for production search workloads.
Warm Tier: S3 Vectors
S3 Vectors is the canonical store for all features. Every feature extracted by the inference engine is written to S3 Vectors, regardless of whether it is also indexed in Qdrant. S3 Vectors provides:
When a collection transitions from hot to warm, its features are removed from Qdrant but remain in S3 Vectors. They can be rehydrated to hot storage if query patterns change.
Cold Tier: S3
Raw source objects and infrequently accessed features are stored in S3. This tier provides:
Archive Tier: Metadata Only
Archived collections retain only their metadata (feature URIs, timestamps, provenance) in MongoDB. The actual vectors and source objects can be re-extracted from cold storage if needed. This tier is used for long-term compliance retention where the data must be recoverable but is never queried.
Lifecycle Management
Collections transition between tiers automatically based on configurable policies:
hot (0-30 days) -> warm (30-90 days) -> cold (90-365 days) -> archive (365+ days)
Transition triggers can be time-based, query-volume-based, or manual. The system tracks query patterns per collection and can recommend tier transitions based on actual usage.
Multi-Stage Retrieval Engine
The retrieval engine executes composable pipelines defined as sequences of typed stages. Each stage takes a set of documents as input and produces a (potentially modified) set of documents as output.
Stage Types
| Stage Type | Purpose | Example |
| `filter` | Narrow the candidate set | Feature search, metadata filter, date range |
| `sort` | Rank candidates | Linear score combination, recency weighting |
| `reduce` | Collapse or sample results | Top-K, deduplication, clustering-based sampling |
| `enrich` | Augment with external data | Semantic join, metadata lookup, API call |
| `apply` | Transform results | Taxonomy classification, LLM summarization |
Data Flow
Pipeline execution follows a strict data flow:
1. The first stage (typically a `filter`) retrieves an initial candidate set from the storage layer. 2. Each subsequent stage receives the output of the previous stage as input. 3. Stages can add, remove, or modify documents in the result set. 4. The final stage's output is returned to the client.
This design ensures that expensive operations (like LLM calls in an `apply` stage) only run on a small, pre-filtered result set rather than the entire corpus.
Optimization
The retrieval engine applies several optimizations:
The Semantic Join
The semantic join is implemented as the `document_enrich` stage. It connects documents from one collection with related documents from another collection using vector similarity.
How It Works
1. For each document in the current result set, the enrich stage extracts a feature vector. 2. That vector is used to search a secondary collection (the "source" collection). 3. Matching documents from the source collection are attached to the original document as enrichment metadata. 4. The enriched result set passes to the next stage.
Example: IP Safety Cross-Reference
A retrieval pipeline for IP safety might:
1. Filter — Search the face collection for faces similar to a query image. 2. Enrich — For each matched face, search the logo collection for logos that appear in the same source video. 3. Apply — Run an LLM to summarize the combined face + logo detections into a risk assessment.
This produces a report that says "Video X contains a face matching Celebrity Y at timestamp 00:15, and a branded logo matching Trademark Z at timestamp 00:18" — all from a single pipeline execution.
Taxonomy Engine
The taxonomy engine provides three modes of classification:
Materialized Taxonomies
Classification runs as part of the ingestion pipeline. When a new object is processed, the taxonomy engine:
1. Extracts the relevant feature (e.g., text transcript, visual embedding). 2. Runs classification against the defined categories (using an LLM or embedding-based classifier). 3. Stores the classification as metadata on the document.
Materialized taxonomies are fast at query time (the classification is pre-computed) but require re-processing when categories change.
On-Demand Taxonomies
Classification runs at query time as an `apply` stage in a retrieval pipeline. The taxonomy engine:
1. Receives documents from the previous pipeline stage. 2. Runs classification on each document. 3. Attaches the classification and passes documents to the next stage.
On-demand taxonomies are flexible (no pre-processing required) but add latency to queries.
Retroactive Taxonomies
Batch classification over historical data. When a new taxonomy is created with `mode=retroactive`, the system:
1. Identifies all existing documents in the target collection. 2. Schedules batch classification jobs through the task queue. 3. Updates document metadata with new classifications as jobs complete.
This enables evolving your classification scheme without re-ingesting source objects.
Scaling
Ray Serve Autoscaling
The inference engine autoscales based on:
Batch Processing
Large ingestion jobs are processed as batches:
1. Files are uploaded to a bucket. 2. The task queue splits the upload into chunks. 3. Each chunk is processed in parallel across available inference engine replicas. 4. Progress is tracked per-batch with percentage completion.
Distributed Inference
For large models or high-throughput workloads, Ray Serve supports:
Comparison to Traditional Data Warehouses
| Aspect | Traditional (Snowflake/BigQuery) | Multimodal (Mixpeek) |
| Data type | Structured (rows, columns) | Unstructured (video, audio, images, documents) |
| Ingestion | ETL / ELT | Object decomposition via inference engine |
| Schema | Defined upfront | Emergent via feature extraction and taxonomies |
| Query language | SQL | Multi-stage retrieval pipelines |
| Joins | Foreign key joins | Semantic joins (vector similarity) |
| Storage | Columnar | Tiered (hot vectors, warm S3 Vectors, cold S3) |
| Compute | Query engine (CPU) | Inference engine (GPU + CPU) |
| Scaling | Warehouse size (S/M/L/XL) | Per-model autoscaling |
| Schema evolution | ALTER TABLE / migrations | Retroactive taxonomies |
Learn more about Mixpeek's architecture at mixpeek.com/docs or explore the multimodal data warehouse page.
