Which embedding models are available?

Mixpeek supports CLIP ViT-L/14, SigLIP, and multilingual E5-large-instruct. You can specify the model in the `model` parameter. The default is CLIP ViT-L/14 which produces 768-dimensional vectors.

How are long videos handled?

Long videos are automatically segmented at scene boundaries. Each segment produces its own embedding. You can also request a single pooled embedding for the entire video by setting `pool_strategy` to 'mean' or 'max'.

Can I use these embeddings with my own vector database?

Absolutely. Embeddings are returned as standard float arrays that work with any vector database including Qdrant, Pinecone, Weaviate, and Milvus. Mixpeek can also store them directly in your namespace.

media

Video
Embeddings
Converter

Generate dense vector embeddings for video content using multimodal models. Embeddings capture visual, audio, and temporal features, enabling semantic search and similarity matching across video collections.

Max file size: 5 GB

Estimated: 3-12 min per hour of video

5 input formats

How It Works

Upload your video or provide a URL.

The video is segmented into clips based on scene boundaries.

Each clip is processed through a multimodal embedding model (CLIP, SigLIP, or E5).

Audio and visual features are fused into a single embedding per segment.

Embeddings are returned as float arrays ready for vector indexing.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/product-demo.mp4",
    from_format="video",
    to_format="embeddings",
    options={
        "model": "clip-vit-l-14",
        "pool_strategy": "per_segment"
    }
)

for segment in result.embeddings:
    print(f"[{segment.start_time}s] dim={len(segment.vector)}")

Use Cases

Build semantic video search engines

Detect near-duplicate or pirated video content

Cluster similar videos for recommendation systems

Enable cross-modal retrieval (search videos with text queries)

Supported Input Formats

MP4

MOV

AVI

MKV

WebM

Quick Info

Categorymedia

Max File Size5 GB

Est. Time3-12 min per hour of video

Extractorvideo-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

Video

Text

Video to Text

Extract spoken dialogue, on-screen text, and scene descriptions from video files using multimodal AI. Produces time-stamped transcripts with speaker diarization and OCR-detected overlays.

Video

Images

Video to Keyframes

Automatically detect scene changes and extract representative keyframes from any video. Each keyframe includes a timestamp, scene label, and optional caption generated by a vision model.

Image

Embeddings

Image to Embeddings

Convert images into dense vector representations using state-of-the-art vision models. Embeddings capture semantic visual features and can be used for similarity search, clustering, and cross-modal retrieval.

Mixed