Mixpeek Logo
    Login / Signup
    Text, Image, Video & Audio Vectors

    Multimodal Embeddings API

    Embeddings are the atoms of the multimodal data warehouse: the features that multi-stage retrieval pipelines query across. Each file is decomposed into dense vectors, and pipelines compose filter, search, rerank, and enrich stages on top to deliver precise results.

    What Are Multimodal Embeddings?

    Embeddings are dense vector representations that capture the semantic meaning of content. Multimodal embeddings extend this concept across data types, mapping text, images, video, and audio into a shared vector space where similarity reflects meaning.

    Text Embeddings

    Generate dense vector representations of text using models like E5, BGE, and multilingual transformers. Capture semantic meaning for search, classification, and clustering.

    E5-Large-v2
    BGE-M3
    Multilingual-E5

    Image Embeddings

    Encode images into vectors using vision models like CLIP, SigLIP, and domain-specific vision transformers. Enable visual search and image-text matching.

    CLIP ViT-L/14
    SigLIP SO400M
    DINOv2

    Video Embeddings

    Generate embeddings at the frame and scene level from video content. Search within videos by visual content, spoken words, and on-screen text.

    CLIP per-frame
    Scene-level pooling
    Temporal encoders

    Audio Embeddings

    Embed audio content including speech, music, and environmental sounds. Combine with transcript embeddings for comprehensive audio understanding.

    CLAP
    Whisper-based
    Speaker encoders

    Supported Embedding Models

    Choose from a curated set of production-grade models, or bring your own.

    ModelModalitiesDimensionsBest For
    CLIP ViT-L/14Image, Text768Cross-modal image-text retrieval
    SigLIP SO400MImage, Text1152High-accuracy visual search and classification
    E5-Large-v2Text1024English text retrieval and semantic search
    BGE-M3Text1024Multilingual text with dense + sparse vectors
    DINOv2Image768Visual feature extraction without text alignment
    Whisper + E5Audio1024Speech content retrieval via transcription embedding

    How It Works

    From raw content to searchable embeddings in four steps.

    1

    Choose Models

    Select embedding models for each modality from Mixpeek's model library, or register your own custom models as feature extractors.

    2

    Ingest Content

    Upload files to an S3-compatible bucket or send them through the API. Mixpeek automatically routes each file to the appropriate embedding model.

    3

    Generate Vectors

    Models run on distributed GPU infrastructure, producing embedding vectors for each piece of content with automatic batching and error handling.

    4

    Index & Search

    Embeddings are stored in Qdrant for fast approximate nearest-neighbor search. Build retrieval pipelines that query across all modalities.

    Use Cases

    Embeddings power a wide range of AI applications across industries.

    Semantic Search

    Search by meaning rather than keywords. Find relevant content even when the query uses different terminology than the source material.

    Cross-Modal Retrieval

    Query in one modality and retrieve results in another. Search video with text, find images with audio descriptions, or match documents to visual content.

    Duplicate Detection

    Identify near-duplicate content across your corpus by comparing embedding similarity. Works across modalities and detects semantic duplicates, not just pixel-identical copies.

    Content Classification

    Classify content into categories using embedding similarity to reference examples. Enable zero-shot classification without collecting labeled training data for each category.

    Recommendation Systems

    Build content recommendations by finding embeddings similar to user interaction history. Works across content types for multimodal recommendation.

    RAG Applications

    Power retrieval-augmented generation by embedding your knowledge base and retrieving relevant context for LLM prompts across text, images, and documents.

    Simple API Integration

    Generate and search embeddings across modalities with a few lines of code.

    embeddings_example.py
    from mixpeek import Mixpeek
    
    client = Mixpeek(api_key="YOUR_API_KEY")
    
    # Generate embeddings for a text query
    text_embedding = client.embed.text(
        model="e5-large-v2",
        input="quarterly revenue growth in emerging markets"
    )
    
    # Generate embeddings for an image
    image_embedding = client.embed.image(
        model="clip-vit-l-14",
        input="s3://product-images/catalog/item-4821.jpg"
    )
    
    # Search across all modalities with text
    results = client.retrievers.search(
        retriever_id="multimodal-index",
        queries=[
            {
                "type": "text",
                "value": "product packaging with sustainability labels",
                "modalities": ["text", "image", "video"]
            }
        ],
        limit=20
    )
    
    # Compare embedding similarity directly
    similarity = client.embed.compare(
        embedding_a=text_embedding.vector,
        embedding_b=image_embedding.vector,
        metric="cosine"
    )
    print(f"Cross-modal similarity: {similarity:.4f}")

    Frequently Asked Questions

    What are multimodal embeddings?

    Multimodal embeddings are vector representations that capture the semantic meaning of content across different data types -- text, images, video, and audio. By mapping diverse content into a shared vector space, embeddings enable similarity search, cross-modal retrieval, and AI applications that understand meaning rather than just matching keywords or pixels.

    What embedding models does Mixpeek support?

    Mixpeek supports a range of embedding models for different modalities: CLIP and SigLIP for vision-language alignment, E5 and BGE for text embedding, DINOv2 for visual features, and Whisper-based pipelines for audio content. You can also register custom models through the plugin system to use proprietary or fine-tuned models.

    Can I use my own embedding models with Mixpeek?

    Yes. Mixpeek's plugin system lets you register custom feature extractors that call any model endpoint. Define the input/output schema including vector dimensions, and Mixpeek handles orchestration, batching, retries, and indexing. This works with HuggingFace models, custom PyTorch endpoints, or any HTTP-based inference service.

    How do cross-modal embeddings work?

    Cross-modal embeddings are produced by models trained with contrastive learning objectives that align representations from different modalities. For example, CLIP and SigLIP learn to place matching image-text pairs close together in the same vector space. This means a text query vector can be compared directly against image vectors to find visually matching content, enabling cross-modal retrieval.

    What vector dimensions does Mixpeek support?

    Mixpeek supports arbitrary embedding dimensions -- whatever your models produce. Common dimensions include 384, 512, 768, 1024, and 1152 depending on the model. The system stores vectors in Qdrant, which supports dense, sparse, and multi-vector representations with configurable distance metrics.

    How does Mixpeek handle embedding generation at scale?

    Mixpeek uses Ray for distributed model inference across GPU workers. When you trigger batch processing, the engine distributes embedding generation across available compute with automatic batching, load balancing, and fault recovery. This handles millions of documents with progress tracking and configurable concurrency limits.

    Can I store multiple embedding types per document?

    Yes. Mixpeek supports named vectors in Qdrant, allowing you to store multiple embedding representations per document. For example, a document can have both a text embedding and a visual embedding (from a document page image), and your retrieval pipeline can query either or both.

    How do I choose the right embedding model for my use case?

    The choice depends on your modalities and use case. For text-only search, E5 or BGE models offer strong performance. For cross-modal image-text retrieval, CLIP or SigLIP is recommended. For visual-only similarity, DINOv2 provides excellent features. Mixpeek makes it easy to test multiple models by creating separate collections with different extractors and comparing retrieval quality.

    Start Generating Multimodal Embeddings

    One API for text, image, video, and audio embeddings. Get started with our free tier or talk to us about enterprise deployment.