Video & Audio Embedding
Extract visual frames and audio transcriptions from video with 1408D multimodal and 1024D text embeddings
Why do anything?
Video content contains visual and audio information. Without multimodal extraction, you can't search video by what you see or hear.
Why now?
Video is the dominant content format. Users expect to search within videos, not just metadata.
Why this feature?
Intelligent frame extraction, audio transcription with speaker diarization, and multimodal embeddings (1408D visual + 1024D transcription).
How It Works
Multimodal extractor decomposes video into visual frames and audio transcription, generating embeddings for both modalities.
Frame Extraction
Scene detection or fixed interval sampling
Visual Embedding
1408D multimodal embeddings per frame
Audio Transcription
Whisper transcription with timestamps
Text Embedding
1024D E5-Large embeddings for transcriptions
Why This Approach
Scene detection captures semantic changes. Multimodal embeddings enable cross-modal search (text query → video frame).
Where This Is Used
Integration
client.collections.create(feature_extractor={"feature_extractor_name": "multimodal_extractor", "version": "v1"})
