Mixpeek Logo
    Feature Extraction

    Video & Audio Embedding

    Extract visual frames and audio transcriptions from video with 1408D multimodal and 1024D text embeddings

    Why do anything?

    Video content contains visual and audio information. Without multimodal extraction, you can't search video by what you see or hear.

    Why now?

    Video is the dominant content format. Users expect to search within videos, not just metadata.

    Why this feature?

    Intelligent frame extraction, audio transcription with speaker diarization, and multimodal embeddings (1408D visual + 1024D transcription).

    How It Works

    Multimodal extractor decomposes video into visual frames and audio transcription, generating embeddings for both modalities.

    1

    Frame Extraction

    Scene detection or fixed interval sampling

    2

    Visual Embedding

    1408D multimodal embeddings per frame

    3

    Audio Transcription

    Whisper transcription with timestamps

    4

    Text Embedding

    1024D E5-Large embeddings for transcriptions

    Why This Approach

    Scene detection captures semantic changes. Multimodal embeddings enable cross-modal search (text query → video frame).

    Integration

    client.collections.create(feature_extractor={"feature_extractor_name": "multimodal_extractor", "version": "v1"})