> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Multimodal Extractor > Unified embeddings for video, image, audio, text, and GIF with transcription, OCR, thumbnails, and structured extraction Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry. Multimodal extractor pipeline showing video splitting, parallel processing with Whisper and embedding models, and output features

Multimodal extractor pipeline showing video splitting, parallel processing with Whisper and embedding models, and output features

The multimodal extractor processes **video, audio, image, text, and GIF content** through a unified pipeline. Videos and audio are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition. Two versions are available: | Version | Embedding Model | Dimensions | Key Difference | | ------- | --------------------------- | ------------------------------ | ------------------------------------------------------------ | | **v1** | Vertex Multimodal Embedding | 1408 | Established, lower dimensionality | | **v2** | Gemini Embedding 2 | 3072 (configurable: 1536, 768) | Higher dimensionality, Matryoshka support, native multimodal | Both versions share the same pipeline (FFmpeg chunking, Whisper, thumbnails, Gemini vision) and differ only in the multimodal embedding step. View extractor details at [api.mixpeek.com/v1/collections/features/extractors/multimodal\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1) or [multimodal\_extractor\_v2](https://api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v2). You can also fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`. ## Pipeline Steps 1. **Filter Dataset** (if collection\_id provided) * Filter to specified collection 2. **Apply Input Mappings** 3. **Detect Content Types** (sample 100 rows) * Identify: video, audio, image, text, or mixed 4. **Content Routing** * **Video:** FFmpeg chunking (time/scene/silence) → Steps 5-10 * **Audio:** FFmpeg audio chunking (time/silence) → Steps 5-8 * **Image:** Skip to Step 8 * **Text:** Skip to Step 8 * **Mixed:** Branch by type, process separately, union results 5. **Transcription** (conditional: if `run_transcription=true`, video/audio only) * Whisper API or Local GPU speech-to-text 6. **Transcription Embeddings** (conditional: if `run_transcription_embedding=true`) * E5-Large text embeddings (1024D) from transcribed audio 7. **Multimodal Embeddings** (conditional: if `run_multimodal_embedding=true`) * **v1:** Vertex AI embeddings (1408D) * **v2:** Gemini Embedding 2 (3072D, configurable) * Unified embedding space enables cross-modal search 8. **Thumbnail Generation** (conditional: if `enable_thumbnails=true`, visual content only) * 640px width at 85% quality, S3 upload with optional CDN 9. **Visual Analysis** (conditional: if `run_video_description` OR `run_ocr=true`, visual content only) * Gemini-based descriptions and/or OCR text extraction 10. **Output** * Segment/document records with embeddings, transcriptions, descriptions, OCR, thumbnails ## When to Use | Use Case | Description | | --------------------------- | --------------------------------------------- | | **Video content libraries** | Search and navigate video segments by content | | **Media platforms** | Search across spoken and visual content | | **Educational content** | Find moments in lectures and tutorials | | **Surveillance/security** | Event detection in footage | | **Social media** | Process user-generated video content | | **Broadcasting/streaming** | Large video catalog management | | **Marketing analytics** | Analyze video campaigns | | **Cross-modal search** | Find videos/images using text queries | ## When NOT to Use | Scenario | Recommended Alternative | | ------------------------------------------- | -------------------------------- | | Static image collections only | `image_extractor` | | Audio-only content | `audio_extractor` | | Very short videos (\< 5 seconds) | Processing overhead not worth it | | Real-time live streams | Specialized streaming extractors | | 8K+ resolution video | Consider downsampling first | | Embed all files in one object as one vector | `gemini_multifile_extractor` | ## Supported Input Types | Input | Type | Description | Processing | | ------- | ------ | ------------------ | ----------------------------------- | | `video` | string | URL or S3 path | Decomposed into segments | | `image` | string | URL or S3 path | Direct embedding (no decomposition) | | `text` | string | Plain text content | Direct embedding | | `gif` | string | URL or S3 path | Treated as video, frame-by-frame | **Supported formats:** * **Video**: MP4, MOV, AVI, MKV, WebM, FLV * **Image**: JPG, PNG, WebP, BMP * **GIF**: Animated GIF ## Input Schema Provide **one** of the following inputs: ```json theme={null} { "video": "s3://bucket/videos/lecture.mp4" } ``` ```json theme={null} { "image": "https://cdn.example.com/products/laptop.jpg" } ``` ```json theme={null} { "text": "High-performance laptop with M3 chip, perfect for developers" } ``` | Field | Type | Description | | ------------------ | ------ | -------------------------------------------------------------- | | `video` | string | URL/S3 path to video file. Recommended: 720p-1080p, \< 2 hours | | `image` | string | URL/S3 path to image file. Recommended: \< 10MB | | `text` | string | Plain text for cross-modal embedding | | `gif` | string | URL/S3 path to GIF file | | `custom_thumbnail` | string | Optional custom thumbnail URL instead of auto-generated | ## Output Schema Each video segment produces one document. Images and text produce one document each without segmentation. ### Segment & Timing Fields | Field | Type | Description | | ------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | `start_time` | number | Segment start time in seconds | | `end_time` | number | Segment end time in seconds | | `start_frame` | integer | Start frame number (`start_time × fps`) | | `end_frame` | integer | End frame number (`end_time × fps`) | | `fps` | number | Frame rate of the preprocessed video used for chunking | | `source_fps` | number | Original source video frame rate before any preprocessing (e.g. 29.97, 30, 23.976). Use this for precise frame-level calculations against the source video | | `duration` | number | Total duration of the entire source video in seconds (not the segment duration) | ### Content Fields | Field | Type | Description | | --------------- | ------ | ------------------------------------------------------------------- | | `transcription` | string | Transcribed audio content (requires `run_transcription`) | | `description` | string | AI-generated segment description (requires `run_video_description`) | | `ocr_text` | string | Text extracted from video frames (requires `run_ocr`) | ### URL Fields | Field | Type | Description | | ------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------- | | `thumbnail_url` | string | S3/CDN URL of the thumbnail image | | `source_video_url` | string | URL of the original source video | | `video_segment_url` | string | S3 URL of this specific segment file. Enables [collection-to-collection decomposition](#collection-to-collection-pipelines) | ### Embedding Fields | Field | Type | Description | | ------------------------------------------------- | ------------ | -------------------------------- | | `multimodal_extractor_v1_multimodal_embedding` | float\[1408] | Vertex AI multimodal embedding | | `multimodal_extractor_v1_transcription_embedding` | float\[1024] | E5-Large transcription embedding | | Field | Type | Description | | ------------------------------------------------- | ------------ | ----------------------------------------------------------------- | | `multimodal_extractor_v2_multimodal_embedding` | float\[3072] | Gemini Embedding 2 multimodal embedding (configurable: 1536, 768) | | `multimodal_extractor_v2_transcription_embedding` | float\[1024] | E5-Large transcription embedding | ### Example Output ```json theme={null} { "start_time": 10.0, "end_time": 20.0, "start_frame": 20, "end_frame": 40, "fps": 2.0, "source_fps": 29.97, "duration": 120.5, "transcription": "Welcome to today's lecture on machine learning fundamentals...", "description": "Instructor standing at whiteboard, introducing ML concepts", "ocr_text": "Machine Learning 101", "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg", "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4", "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4", "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"], "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"] } ``` ```json theme={null} { "start_time": 10.0, "end_time": 20.0, "start_frame": 20, "end_frame": 40, "fps": 2.0, "source_fps": 29.97, "duration": 120.5, "transcription": "Welcome to today's lecture on machine learning fundamentals...", "description": "Instructor standing at whiteboard, introducing ML concepts", "ocr_text": "Machine Learning 101", "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg", "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4", "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4", "multimodal_extractor_v2_multimodal_embedding": [0.015, -0.038, "...3072 floats"], "multimodal_extractor_v2_transcription_embedding": [0.018, -0.032, "...1024 floats"] } ``` `fps` reflects the preprocessed video frame rate (e.g. 2.0 fps after downsampling). `source_fps` is the original video's native frame rate (e.g. 29.97). Use `source_fps` when you need to map timestamps back to exact frame numbers in the original source file. ## Parameters ### Video Splitting | Parameter | Type | Default | Description | | ---------------------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------- | | `split_method` | string | `"time"` | Primary video splitting strategy: `time`, `scene`, or `silence` | | `max_segment_duration` | float | `30.0` | Maximum seconds per segment. Scene/silence segments longer than this are subdivided. Set to `null` to disable | #### Split Methods **Fixed interval splitting** - Splits video into segments of equal duration. | Parameter | Type | Default | Description | | --------------------- | ------- | ------- | ------------------------------------ | | `time_split_interval` | integer | `10` | Interval in seconds for each segment | **Characteristics:** * Predictable segment count: `video_duration / interval` * Consistent chunk sizes for uniform processing * May cut mid-sentence or mid-scene **Best for:** General purpose, consistent chunking, when you need predictable segment counts ```json theme={null} { "split_method": "time", "time_split_interval": 10 } ``` **Visual change detection** - Splits video when significant visual changes occur (shot changes, transitions). | Parameter | Type | Default | Description | | --------------------------- | ----- | ------- | ------------------------------- | | `scene_detection_threshold` | float | `0.5` | Sensitivity threshold (0.0-1.0) | **Threshold guide:** * `0.3` - High sensitivity, detects subtle changes (more segments) * `0.5` - Balanced (default) * `0.7` - Low sensitivity, only major scene changes (fewer segments) **Characteristics:** * Variable segment count (typically 2-20 per minute) * Segments align with visual content boundaries * Better for content with distinct shots/scenes **Best for:** Movies, dynamic content, shot changes, music videos, advertisements ```json theme={null} { "split_method": "scene", "scene_detection_threshold": 0.5 } ``` **Audio pause detection** - Splits video at moments of silence or low audio. | Parameter | Type | Default | Description | | ---------------------- | ------- | ------- | ---------------------------------------------------- | | `silence_db_threshold` | integer | `-40` | Decibel level below which audio is considered silent | **Threshold guide:** * `-50` dB - Detects very quiet moments (more segments) * `-40` dB - Balanced (default) * `-30` dB - Only detects near-silence (fewer segments) **Characteristics:** * Variable segment count (typically 5-30 per minute) * Segments align with natural speech pauses * Preserves complete sentences/thoughts **Best for:** Lectures, presentations, conversations, podcasts, interviews ```json theme={null} { "split_method": "silence", "silence_db_threshold": -40 } ``` #### Split Methods Comparison | Method | Segments/Min | Predictability | Best For | | --------- | ------------------ | -------------- | ----------------------------------- | | `time` | 60 / interval\_sec | High | General purpose, batch processing | | `scene` | Variable (2-20) | Low | Movies, ads, dynamic visual content | | `silence` | Variable (5-30) | Medium | Lectures, podcasts, spoken content | ### Feature Extraction Parameters | Parameter | Type | Default | Description | | ----------------------------- | ------- | -------------------------- | ------------------------------------------------ | | `run_transcription` | boolean | `true` (v1) / `false` (v2) | Run Whisper transcription on audio | | `transcription_language` | string | `"en"` | Language for transcription | | `run_transcription_embedding` | boolean | `true` (v1) / `false` (v2) | Generate E5 embeddings for transcriptions | | `run_multimodal_embedding` | boolean | `true` | Generate multimodal embeddings | | `run_video_description` | boolean | `false` | Generate AI descriptions (adds 1-2s per segment) | | `run_ocr` | boolean | `false` | Extract text from video frames | ### Thumbnail Parameters | Parameter | Type | Default | Description | | ------------------- | ------- | ------- | --------------------------------- | | `enable_thumbnails` | boolean | `true` | Generate thumbnail images | | `use_cdn` | boolean | `false` | Use CloudFront CDN for thumbnails | **CDN benefits**: Faster global delivery, permanent URLs, reduced bandwidth costs. ### v2-Only Parameters These parameters are only available on `multimodal_extractor` v2: | Parameter | Type | Default | Description | | ----------------------- | ------- | ---------------------- | ------------------------------------------------------------------------------------------------------- | | `output_dimensionality` | integer | `3072` | Embedding dimensions. Gemini Embedding 2 supports Matryoshka reduction: `3072` (full), `1536`, or `768` | | `task_type` | string | `"RETRIEVAL_DOCUMENT"` | Embedding task hint: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY`, `CLASSIFICATION` | At query time, Mixpeek automatically uses `RETRIEVAL_QUERY` — you only need to set `task_type` at index time. The default `RETRIEVAL_DOCUMENT` is correct for most use cases. ### Embedding Task When `run_transcription_embedding` is enabled, the E5 model generates text embeddings from transcribed audio. By default, these use `retrieval_document` for asymmetric search. Set `embedding_task` at the **collection level**, not on the extractor. See [Collection Embedding Task](/platform/processing#embedding-task) for full details and examples. This only affects the E5 transcription embeddings. Vertex AI multimodal embeddings (v1) and Gemini Embedding 2 (v2) are not instruction-aware and ignore this parameter. ### Description Generation Parameters | Parameter | Type | Default | Description | | ------------------------------------- | ------- | ----------------------------------------- | ----------------------------------- | | `description_prompt` | string | `"Describe the video segment in detail."` | Prompt for Gemini | | `generation_config.temperature` | float | `0.7` | Randomness (higher = more creative) | | `generation_config.max_output_tokens` | integer | `1024` | Maximum description length | | `generation_config.top_p` | float | `0.8` | Nucleus sampling | ### LLM Structured Extraction | Parameter | Type | Default | Description | | ---------------- | ---------------- | ------- | ------------------------------- | | `response_shape` | string \| object | `null` | Custom structured output schema | **Natural Language Mode:** ```json theme={null} { "response_shape": "Extract product names, colors, materials, and aesthetic style labels from this fashion segment" } ``` **JSON Schema Mode:** ```json theme={null} { "response_shape": { "type": "object", "properties": { "products": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "category": { "type": "string" }, "visibility_percentage": { "type": "integer", "minimum": 0, "maximum": 100 } } } }, "aesthetic": { "type": "string" } } } } ``` ## Configuration Examples ```json v1 — Video with Time-Based Splitting theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "video_url" }, "field_passthrough": [ { "source_path": "metadata.video_id" } ], "parameters": { "split_method": "time", "time_split_interval": 10, "run_transcription": true, "run_multimodal_embedding": true, "enable_thumbnails": true } } } ``` ```json v1 — Video with Scene Detection theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "video_url" }, "parameters": { "split_method": "scene", "scene_detection_threshold": 0.5, "run_transcription": true, "run_video_description": true, "enable_thumbnails": true } } } ``` ```json v1 — Lecture Video with Silence Splitting theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "lecture_url" }, "parameters": { "split_method": "silence", "silence_db_threshold": -40, "run_transcription": true, "transcription_language": "en", "run_ocr": true, "enable_thumbnails": true } } } ``` ```json v2 — Video with Gemini Embedding 2 theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v2", "input_mappings": { "video": "video_url" }, "parameters": { "split_method": "time", "time_split_interval": 10, "run_multimodal_embedding": true, "output_dimensionality": 3072, "enable_thumbnails": true } } } ``` ```json v2 — Compact Embeddings (768D) theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v2", "input_mappings": { "video": "video_url" }, "parameters": { "split_method": "scene", "scene_detection_threshold": 0.5, "run_multimodal_embedding": true, "output_dimensionality": 768, "run_transcription": true, "run_transcription_embedding": true, "enable_thumbnails": true } } } ``` ```json Image Embedding (v1 or v2) theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "image": "image_url" }, "field_passthrough": [ { "source_path": "metadata.product_id" } ], "parameters": { "run_multimodal_embedding": true, "enable_thumbnails": true } } } ``` ```json Text Embedding (Cross-Modal Search) theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "text": "product_description" }, "parameters": { "run_multimodal_embedding": true } } } ``` ```json v1 — Full Extraction with All Features theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "video_url" }, "parameters": { "split_method": "scene", "scene_detection_threshold": 0.5, "run_transcription": true, "run_transcription_embedding": true, "run_multimodal_embedding": true, "run_video_description": true, "run_ocr": true, "enable_thumbnails": true, "use_cdn": true, "description_prompt": "Describe what is happening in this video segment, including any visible products, people, and actions.", "generation_config": { "temperature": 0.7, "max_output_tokens": 1024 } } } } ``` ```json Fashion/E-commerce with Structured Extraction theme={null} { "feature_extractor": { "feature_extractor_name": "multimodal_extractor", "version": "v1", "input_mappings": { "video": "fashion_video_url" }, "parameters": { "split_method": "scene", "run_multimodal_embedding": true, "run_video_description": true, "response_shape": { "type": "object", "properties": { "products": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "category": { "type": "string" }, "color": { "type": "string" }, "visibility_percentage": { "type": "integer" } } } }, "aesthetic": { "type": "string" }, "setting": { "type": "string" } } } } } } ``` ## Performance & Costs ### Processing Speed | Content Type | Speed | | ------------ | --------------------------------------------- | | Video | 0.5-2x realtime (depends on features enabled) | | Image | \< 1 second | | Text | \< 100ms | **Example**: 10-minute video → 5-20 minutes processing time | Feature | Latency per Segment | | ---------------- | --------------------------- | | Transcription | \~200ms per second of audio | | Visual embedding | \~50ms | | OCR | \~300ms | | Description | \~2s | ### Cost Estimates (per minute of video) | Configuration | Cost | | ---------------------------------------- | ------ | | **Minimal** (transcription + embeddings) | \$0.01 | | **Standard** (+ OCR) | \$0.05 | | **Full** (+ descriptions) | \$0.15 | **Images**: $0.001 per image **Text**: $0.0001 per query ## Vector Indexes ### Multimodal Embedding | Property | Value | | -------------------- | --------------------------------------------------------------- | | **Feature URI** | `mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding` | | **Index name** | `multimodal_extractor_v1_multimodal_embedding` | | **Dimensions** | 1408 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `vertex_multimodal_embedding` | | **Supported inputs** | video, text, image | ### Transcription Embedding | Property | Value | | -------------------- | --------------------------------------------------------------------- | | **Feature URI** | `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1` | | **Index name** | `multimodal_extractor_v1_transcription_embedding` | | **Dimensions** | 1024 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `multilingual_e5_large_instruct_v1` | | **Supported inputs** | text, string | In retrievers, reference these by their **Feature URI** — the output name is the model name, **not** the `multimodal_extractor_v1_*` index name. ### Multimodal Embedding | Property | Value | | -------------------- | ---------------------------------------------- | | **Index name** | `multimodal_extractor_v2_multimodal_embedding` | | **Dimensions** | 3072 (configurable: 1536, 768) | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `google/gemini-embedding-2` | | **Supported inputs** | video, text, image, audio | ### Transcription Embedding | Property | Value | | -------------------- | ------------------------------------------------- | | **Index name** | `multimodal_extractor_v2_transcription_embedding` | | **Dimensions** | 1024 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `intfloat/multilingual-e5-large-instruct` | | **Supported inputs** | text, string | ## Choosing v1 vs v2 | Consideration | v1 | v2 | | ----------------------- | ---------------------- | ------------------------------------------ | | **Embedding quality** | Good | Better (natively multimodal) | | **Dimensions** | 1408 (fixed) | 3072, 1536, or 768 (configurable) | | **Storage per vector** | 5.5 KB | 12 KB (3072D), 6 KB (1536D), 3 KB (768D) | | **Audio input support** | Via transcription only | Native audio embedding | | **Matryoshka support** | No | Yes — reduce dimensions without reindexing | | **Stability** | Production-proven | Newer | **Recommendation:** Use **v2** for new projects. Use **v1** if you have existing collections and don't need higher dimensions or native audio embedding. ## Limitations * **Video duration**: Recommend \< 2 hours for optimal processing * **Resolution**: 8K+ videos should be downsampled * **Real-time**: Not suitable for live streaming * **Short videos**: \< 5 second videos have disproportionate overhead * **Audio quality**: Transcription accuracy depends on audio clarity * **OCR/Description**: Add significant processing time, enable only when needed ## Collection-to-Collection Pipelines The `video_segment_url` output enables decomposition chains: 1. **Initial collection**: Time-based segments (5s intervals) 2. **Downstream collection**: Scene detection within each segment 3. **Final collection**: Enhanced processing with different models ```json theme={null} { "input_mappings": { "video": "video_segment_url" } } ``` ## Related * [Feature Extractors Overview](/processing/feature-extractors) * [Gemini Multifile Extractor](/processing/extractors/gemini-multifile) — Embed multiple files per object into one vector * [Passthrough Extractor](/processing/extractors/passthrough) * [Text Extractor](/processing/extractors/text)