What embedding dimensions are available?

CLIP ViT-L/14 produces 768-dimensional vectors. SigLIP offers 768 or 1024 dimensions depending on the variant. E5-large-instruct produces 1024-dimensional vectors. You select the model at request time.

Are the embeddings compatible with text queries?

Yes, when using CLIP or SigLIP models. These models share an embedding space between text and images, so you can search images using text query embeddings and vice versa.

Can I batch-process multiple images?

Yes. Pass an array of sources in the `sources` field. Batch processing is significantly faster than individual requests because images are processed in parallel on GPU.

media

Image
Embeddings
Converter

Convert images into dense vector representations using state-of-the-art vision models. Embeddings capture semantic visual features and can be used for similarity search, clustering, and cross-modal retrieval.

Max file size: 50 MB

Estimated: 1-3 sec per image

5 input formats

How It Works

Upload an image or provide a URL.

The image is resized and normalized for the selected model.

The vision encoder produces a dense embedding vector.

The vector is returned as a float array with model metadata.

Optionally, the embedding is stored directly in your Mixpeek namespace.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/product.jpg",
    from_format="image",
    to_format="embeddings",
    options={
        "model": "clip-vit-l-14"
    }
)

print(f"Dimensions: {len(result.embedding)}")

Use Cases

Build visual similarity search for e-commerce catalogs

Detect near-duplicate images across content libraries

Power reverse image search functionality

Enable text-to-image retrieval using shared embedding spaces

Supported Input Formats

JPEG

PNG

WebP

TIFF

BMP

Quick Info

Categorymedia

Max File Size50 MB

Est. Time1-3 sec per image

Extractorimage-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

Video

Embeddings

Video to Embeddings

Generate dense vector embeddings for video content using multimodal models. Embeddings capture visual, audio, and temporal features, enabling semantic search and similarity matching across video collections.

Image

Text

Image to Text

Extract all readable text from images using advanced OCR combined with a vision-language model. Handles printed text, handwriting, complex layouts, receipts, signs, and multi-language documents.

Image

Caption

Image to Caption

Generate natural-language captions for images using a vision-language model. Produces concise, descriptive sentences suitable for alt text, content indexing, and accessibility compliance.

Mixed