Multimodal

Multimodal Extractor

Unified embeddings for video, audio, image, and text: scene/silence chunking, Whisper transcription, thumbnails, and Gemini vision.

412K runs

Note: This playground provides simulated output to showcase functionality. No input data is processed or stored on our servers. Use this demo to explore the feature extractor's capabilities before integrating it into your application.

Input

File URL string

Enter a URL to a video file

Upload video

Drag and drop a video file here, or click to browse

Select File

Output

{}

Ready to run Multimodal Extractor on your data? Spin it up in Studio: no infra to host.

Run this in Studio

Already have embeddings? Skip extraction: search your own vectors with MVS, from $25/mo for up to 1M vectors.

Try MVS →

Recent updates

Full changelog

May 9, 2026Multimodal Extractor v2 with Gemini Embedding 2New multimodal extractor generates 3072-dimensional embeddings using Gemini Embedding 2, enabling richer cross-modal search across text, images, and video.