How does cross-modal search work?

All modalities are projected into a shared embedding space using models like CLIP or SigLIP. A text embedding and an image embedding of semantically similar content will be close together in this space, enabling natural cross-modal retrieval.

Can I combine multiple modalities in a single query?

Yes. You can provide text + image, or text + audio as a combined query. The embeddings are fused using weighted averaging or late-fusion strategies, configurable via the `fusion_strategy` parameter.

What dimensions are the multimodal embeddings?

The default CLIP-based embeddings are 768 dimensions. All modality-specific embeddings are projected to the same dimensionality, so they can be compared directly.

Is there quality loss when fusing modalities?

Cross-modal embeddings involve a projection step that slightly reduces per-modality specificity. For most retrieval tasks, the benefit of unified search outweighs this trade-off. For single-modality precision, use the dedicated converters.

embedding

Mixed
Embeddings
Converter

Generate unified vector embeddings from mixed-modality inputs -- text, images, audio, and video combined. Enables cross-modal search where any modality can query any other modality in a single vector space.

Max file size: 5 GB

Estimated: 1-15 sec depending on modality

8 input formats

How It Works

Provide one or more inputs of any modality.

Each input is processed through its modality-specific encoder.

Modality embeddings are projected into a shared vector space.

A fused embedding is produced that represents the combined input.

The unified embedding enables cross-modal similarity search.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    sources=[
        {"type": "text", "content": "A red sports car on a mountain road"},
        {"type": "image", "url": "https://example.com/car.jpg"}
    ],
    from_format="multimodal",
    to_format="embeddings",
    options={
        "model": "clip-vit-l-14",
        "fusion_strategy": "weighted_average",
        "weights": {"text": 0.4, "image": 0.6}
    }
)

print(f"Fused embedding dim: {len(result.embedding)}")

Use Cases

Search videos using text queries and vice versa

Build unified search across documents, images, and audio

Create recommendation systems that span content types

Enable 'find similar' features across an entire media library

Supported Input Formats

JPEG

PNG

MP4

MP3

WAV

TXT

PDF

JSON

Quick Info

Categoryembedding

Max File Size5 GB

Est. Time1-15 sec depending on modality

Extractormultimodal-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

Video

Embeddings

Video to Embeddings

Generate dense vector embeddings for video content using multimodal models. Embeddings capture visual, audio, and temporal features, enabling semantic search and similarity matching across video collections.

Image

Embeddings

Image to Embeddings

Convert images into dense vector representations using state-of-the-art vision models. Embeddings capture semantic visual features and can be used for similarity search, clustering, and cross-modal retrieval.

Audio

Embeddings

Audio to Embeddings

Convert audio files into dense vector embeddings that capture spoken content, tone, and acoustic features. Use embeddings for audio search, speaker verification, and content-based recommendation.

Text

Embeddings

Text to Embeddings

Convert text strings, paragraphs, or documents into dense vector embeddings using state-of-the-art language models. Supports batching, chunking, and multiple model options for optimal retrieval performance.

Ready to convert mixed to embeddings?

Start using the Mixpeek Multimodal to Embeddings in minutes. Sign up for a free API key and follow the documentation to get started.

MixedEmbeddingsConverter

How It Works

Code Examples

Use Cases

Supported Input Formats

Quick Info

Try This Conversion

Frequently Asked Questions

Related Converters

Video to Embeddings

Image to Embeddings

Audio to Embeddings

Text to Embeddings

Ready to convert mixed to embeddings?

Mixed
Embeddings
Converter