Question 1

What are multimodal embeddings?

Accepted Answer

Multimodal embeddings are vector representations that capture the semantic meaning of content across different data types -- text, images, video, and audio. By mapping diverse content into a shared vector space, embeddings enable similarity search, cross-modal retrieval, and AI applications that understand meaning rather than just matching keywords or pixels.

Question 2

What embedding models does Mixpeek support?

Accepted Answer

Mixpeek supports a range of embedding models for different modalities: CLIP and SigLIP for vision-language alignment, E5 and BGE for text embedding, DINOv2 for visual features, and Whisper-based pipelines for audio content. You can also register custom models through the plugin system to use proprietary or fine-tuned models.

Question 3

Can I use my own embedding models with Mixpeek?

Accepted Answer

Yes. Mixpeek's plugin system lets you register custom feature extractors that call any model endpoint. Define the input/output schema including vector dimensions, and Mixpeek handles orchestration, batching, retries, and indexing. This works with HuggingFace models, custom PyTorch endpoints, or any HTTP-based inference service.

Question 4

How do cross-modal embeddings work?

Accepted Answer

Cross-modal embeddings are produced by models trained with contrastive learning objectives that align representations from different modalities. For example, CLIP and SigLIP learn to place matching image-text pairs close together in the same vector space. This means a text query vector can be compared directly against image vectors to find visually matching content, enabling cross-modal retrieval.

Question 5

What vector dimensions does Mixpeek support?

Accepted Answer

Mixpeek supports arbitrary embedding dimensions -- whatever your models produce. Common dimensions include 384, 512, 768, 1024, and 1152 depending on the model. The system stores vectors in Qdrant, which supports dense, sparse, and multi-vector representations with configurable distance metrics.

Question 6

How does Mixpeek handle embedding generation at scale?

Accepted Answer

Mixpeek uses Ray for distributed model inference across GPU workers. When you trigger batch processing, the engine distributes embedding generation across available compute with automatic batching, load balancing, and fault recovery. This handles millions of documents with progress tracking and configurable concurrency limits.

Question 7

Can I store multiple embedding types per document?

Accepted Answer

Yes. Mixpeek supports named vectors in Qdrant, allowing you to store multiple embedding representations per document. For example, a document can have both a text embedding and a visual embedding (from a document page image), and your retrieval pipeline can query either or both.

Question 8

How do I choose the right embedding model for my use case?

Accepted Answer

The choice depends on your modalities and use case. For text-only search, E5 or BGE models offer strong performance. For cross-modal image-text retrieval, CLIP or SigLIP is recommended. For visual-only similarity, DINOv2 provides excellent features. Mixpeek makes it easy to test multiple models by creating separate collections with different extractors and comparing retrieval quality.

Question 9

Can I upgrade embedding models without re-indexing everything at once?

Accepted Answer

Yes. Mixpeek uses namespace isolation so each model version gets its own index. You create a new collection with the updated extractor, trigger reprocessing against the same source data, and the batch pipeline backfills in the background. Your v1 index keeps serving while v2 builds. When quality is validated, you point your retriever at the new namespace. Zero downtime, no mixed-version indexes.

Question 10

Are embeddings from different models compatible with each other?

Accepted Answer

No. Vectors from different models occupy incompatible coordinate spaces, even when the dimensionality is the same. CLIP and SigLIP both produce 768-dimensional vectors, but comparing them produces meaningless results. Mixpeek prevents this by isolating each model version in its own namespace and tracking the extractor, version, and embedding space for every stored vector.

Model	Modalities	Dimensions	Best For
CLIP ViT-L/14	Image, Text	768	Cross-modal image-text retrieval
SigLIP SO400M	Image, Text	1152	High-accuracy visual search and classification
E5-Large-v2	Text	1024	English text retrieval and semantic search
BGE-M3	Text	1024	Multilingual text with dense + sparse vectors
DINOv2	Image	768	Visual feature extraction without text alignment
Whisper + E5	Audio	1024	Speech content retrieval via transcription embedding

Multimodal Embeddings API

What Are Multimodal Embeddings?

Text Embeddings

Image Embeddings

Video Embeddings

Audio Embeddings

Supported Embedding Models

How It Works

Choose Models

Ingest Content

Generate Vectors

Index & Search

Embedding Portability and Versioning

The Lock-In Problem

The Upgrade Trap

How Mixpeek Solves It

Upgrade without downtime

Namespace isolation

Automatic re-encoding

Retriever-level cutover

Source-data retention

Use Cases

Semantic Search

Cross-Modal Retrieval

Duplicate Detection

Content Classification

Recommendation Systems

RAG Applications

Simple API Integration

Frequently Asked Questions