> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Multimodal Extractor

> Unified embeddings for video, image, audio, text, and GIF with transcription, OCR, thumbnails, and structured extraction

<Card title="View on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors/blob/main/extractors/multimodal_extractor/README.md" horizontal>
  Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
</Card>

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/extractors/multimodal.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=8f068e849242497def22baeb67314d46" alt="Multimodal extractor pipeline showing video splitting, parallel processing with Whisper and embedding models, and output features" width="1200" height="520" data-path="assets/extractors/multimodal.svg" />
</Frame>

The multimodal extractor processes **video, audio, image, text, and GIF content** through a unified pipeline. Videos and audio are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition.

Two versions are available:

| Version | Embedding Model             | Dimensions                     | Key Difference                                               |
| ------- | --------------------------- | ------------------------------ | ------------------------------------------------------------ |
| **v1**  | Vertex Multimodal Embedding | 1408                           | Established, lower dimensionality                            |
| **v2**  | Gemini Embedding 2          | 3072 (configurable: 1536, 768) | Higher dimensionality, Matryoshka support, native multimodal |

Both versions share the same pipeline (FFmpeg chunking, Whisper, thumbnails, Gemini vision) and differ only in the multimodal embedding step.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/multimodal\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1) or [multimodal\_extractor\_v2](https://api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v2). You can also fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Filter Dataset** (if collection\_id provided)
   * Filter to specified collection
2. **Apply Input Mappings**
3. **Detect Content Types** (sample 100 rows)
   * Identify: video, audio, image, text, or mixed
4. **Content Routing**
   * **Video:** FFmpeg chunking (time/scene/silence) → Steps 5-10
   * **Audio:** FFmpeg audio chunking (time/silence) → Steps 5-8
   * **Image:** Skip to Step 8
   * **Text:** Skip to Step 8
   * **Mixed:** Branch by type, process separately, union results
5. **Transcription** (conditional: if `run_transcription=true`, video/audio only)
   * Whisper API or Local GPU speech-to-text
6. **Transcription Embeddings** (conditional: if `run_transcription_embedding=true`)
   * E5-Large text embeddings (1024D) from transcribed audio
7. **Multimodal Embeddings** (conditional: if `run_multimodal_embedding=true`)
   * **v1:** Vertex AI embeddings (1408D)
   * **v2:** Gemini Embedding 2 (3072D, configurable)
   * Unified embedding space enables cross-modal search
8. **Thumbnail Generation** (conditional: if `enable_thumbnails=true`, visual content only)
   * 640px width at 85% quality, S3 upload with optional CDN
9. **Visual Analysis** (conditional: if `run_video_description` OR `run_ocr=true`, visual content only)
   * Gemini-based descriptions and/or OCR text extraction
10. **Output**
    * Segment/document records with embeddings, transcriptions, descriptions, OCR, thumbnails

## When to Use

| Use Case                    | Description                                   |
| --------------------------- | --------------------------------------------- |
| **Video content libraries** | Search and navigate video segments by content |
| **Media platforms**         | Search across spoken and visual content       |
| **Educational content**     | Find moments in lectures and tutorials        |
| **Surveillance/security**   | Event detection in footage                    |
| **Social media**            | Process user-generated video content          |
| **Broadcasting/streaming**  | Large video catalog management                |
| **Marketing analytics**     | Analyze video campaigns                       |
| **Cross-modal search**      | Find videos/images using text queries         |

## When NOT to Use

| Scenario                                    | Recommended Alternative          |
| ------------------------------------------- | -------------------------------- |
| Static image collections only               | `image_extractor`                |
| Audio-only content                          | `audio_extractor`                |
| Very short videos (\< 5 seconds)            | Processing overhead not worth it |
| Real-time live streams                      | Specialized streaming extractors |
| 8K+ resolution video                        | Consider downsampling first      |
| Embed all files in one object as one vector | `gemini_multifile_extractor`     |

## Supported Input Types

| Input   | Type   | Description        | Processing                          |
| ------- | ------ | ------------------ | ----------------------------------- |
| `video` | string | URL or S3 path     | Decomposed into segments            |
| `image` | string | URL or S3 path     | Direct embedding (no decomposition) |
| `text`  | string | Plain text content | Direct embedding                    |
| `gif`   | string | URL or S3 path     | Treated as video, frame-by-frame    |

**Supported formats:**

* **Video**: MP4, MOV, AVI, MKV, WebM, FLV
* **Image**: JPG, PNG, WebP, BMP
* **GIF**: Animated GIF

## Input Schema

Provide **one** of the following inputs:

```json theme={null}
{
  "video": "s3://bucket/videos/lecture.mp4"
}
```

```json theme={null}
{
  "image": "https://cdn.example.com/products/laptop.jpg"
}
```

```json theme={null}
{
  "text": "High-performance laptop with M3 chip, perfect for developers"
}
```

| Field              | Type   | Description                                                    |
| ------------------ | ------ | -------------------------------------------------------------- |
| `video`            | string | URL/S3 path to video file. Recommended: 720p-1080p, \< 2 hours |
| `image`            | string | URL/S3 path to image file. Recommended: \< 10MB                |
| `text`             | string | Plain text for cross-modal embedding                           |
| `gif`              | string | URL/S3 path to GIF file                                        |
| `custom_thumbnail` | string | Optional custom thumbnail URL instead of auto-generated        |

## Output Schema

Each video segment produces one document. Images and text produce one document each without segmentation.

### Segment & Timing Fields

| Field         | Type    | Description                                                                                                                                                |
| ------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `start_time`  | number  | Segment start time in seconds                                                                                                                              |
| `end_time`    | number  | Segment end time in seconds                                                                                                                                |
| `start_frame` | integer | Start frame number (`start_time × fps`)                                                                                                                    |
| `end_frame`   | integer | End frame number (`end_time × fps`)                                                                                                                        |
| `fps`         | number  | Frame rate of the preprocessed video used for chunking                                                                                                     |
| `source_fps`  | number  | Original source video frame rate before any preprocessing (e.g. 29.97, 30, 23.976). Use this for precise frame-level calculations against the source video |
| `duration`    | number  | Total duration of the entire source video in seconds (not the segment duration)                                                                            |

### Content Fields

| Field           | Type   | Description                                                         |
| --------------- | ------ | ------------------------------------------------------------------- |
| `transcription` | string | Transcribed audio content (requires `run_transcription`)            |
| `description`   | string | AI-generated segment description (requires `run_video_description`) |
| `ocr_text`      | string | Text extracted from video frames (requires `run_ocr`)               |

### URL Fields

| Field               | Type   | Description                                                                                                                 |
| ------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------- |
| `thumbnail_url`     | string | S3/CDN URL of the thumbnail image                                                                                           |
| `source_video_url`  | string | URL of the original source video                                                                                            |
| `video_segment_url` | string | S3 URL of this specific segment file. Enables [collection-to-collection decomposition](#collection-to-collection-pipelines) |

### Embedding Fields

<Tabs>
  <Tab title="v1">
    | Field                                             | Type         | Description                      |
    | ------------------------------------------------- | ------------ | -------------------------------- |
    | `multimodal_extractor_v1_multimodal_embedding`    | float\[1408] | Vertex AI multimodal embedding   |
    | `multimodal_extractor_v1_transcription_embedding` | float\[1024] | E5-Large transcription embedding |
  </Tab>

  <Tab title="v2">
    | Field                                             | Type         | Description                                                       |
    | ------------------------------------------------- | ------------ | ----------------------------------------------------------------- |
    | `multimodal_extractor_v2_multimodal_embedding`    | float\[3072] | Gemini Embedding 2 multimodal embedding (configurable: 1536, 768) |
    | `multimodal_extractor_v2_transcription_embedding` | float\[1024] | E5-Large transcription embedding                                  |
  </Tab>
</Tabs>

### Example Output

<Tabs>
  <Tab title="v1">
    ```json theme={null}
    {
      "start_time": 10.0,
      "end_time": 20.0,
      "start_frame": 20,
      "end_frame": 40,
      "fps": 2.0,
      "source_fps": 29.97,
      "duration": 120.5,
      "transcription": "Welcome to today's lecture on machine learning fundamentals...",
      "description": "Instructor standing at whiteboard, introducing ML concepts",
      "ocr_text": "Machine Learning 101",
      "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
      "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4",
      "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4",
      "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"],
      "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
    }
    ```
  </Tab>

  <Tab title="v2">
    ```json theme={null}
    {
      "start_time": 10.0,
      "end_time": 20.0,
      "start_frame": 20,
      "end_frame": 40,
      "fps": 2.0,
      "source_fps": 29.97,
      "duration": 120.5,
      "transcription": "Welcome to today's lecture on machine learning fundamentals...",
      "description": "Instructor standing at whiteboard, introducing ML concepts",
      "ocr_text": "Machine Learning 101",
      "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
      "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4",
      "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4",
      "multimodal_extractor_v2_multimodal_embedding": [0.015, -0.038, "...3072 floats"],
      "multimodal_extractor_v2_transcription_embedding": [0.018, -0.032, "...1024 floats"]
    }
    ```
  </Tab>
</Tabs>

<Info>
  `fps` reflects the preprocessed video frame rate (e.g. 2.0 fps after downsampling). `source_fps` is the original video's native frame rate (e.g. 29.97). Use `source_fps` when you need to map timestamps back to exact frame numbers in the original source file.
</Info>

## Parameters

### Video Splitting

| Parameter              | Type   | Default  | Description                                                                                                   |
| ---------------------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------- |
| `split_method`         | string | `"time"` | Primary video splitting strategy: `time`, `scene`, or `silence`                                               |
| `max_segment_duration` | float  | `30.0`   | Maximum seconds per segment. Scene/silence segments longer than this are subdivided. Set to `null` to disable |

#### Split Methods

<Tabs>
  <Tab title="time">
    **Fixed interval splitting** - Splits video into segments of equal duration.

    | Parameter             | Type    | Default | Description                          |
    | --------------------- | ------- | ------- | ------------------------------------ |
    | `time_split_interval` | integer | `10`    | Interval in seconds for each segment |

    **Characteristics:**

    * Predictable segment count: `video_duration / interval`
    * Consistent chunk sizes for uniform processing
    * May cut mid-sentence or mid-scene

    **Best for:** General purpose, consistent chunking, when you need predictable segment counts

    ```json theme={null}
    {
      "split_method": "time",
      "time_split_interval": 10
    }
    ```
  </Tab>

  <Tab title="scene">
    **Visual change detection** - Splits video when significant visual changes occur (shot changes, transitions).

    | Parameter                   | Type  | Default | Description                     |
    | --------------------------- | ----- | ------- | ------------------------------- |
    | `scene_detection_threshold` | float | `0.5`   | Sensitivity threshold (0.0-1.0) |

    **Threshold guide:**

    * `0.3` - High sensitivity, detects subtle changes (more segments)
    * `0.5` - Balanced (default)
    * `0.7` - Low sensitivity, only major scene changes (fewer segments)

    **Characteristics:**

    * Variable segment count (typically 2-20 per minute)
    * Segments align with visual content boundaries
    * Better for content with distinct shots/scenes

    **Best for:** Movies, dynamic content, shot changes, music videos, advertisements

    ```json theme={null}
    {
      "split_method": "scene",
      "scene_detection_threshold": 0.5
    }
    ```
  </Tab>

  <Tab title="silence">
    **Audio pause detection** - Splits video at moments of silence or low audio.

    | Parameter              | Type    | Default | Description                                          |
    | ---------------------- | ------- | ------- | ---------------------------------------------------- |
    | `silence_db_threshold` | integer | `-40`   | Decibel level below which audio is considered silent |

    **Threshold guide:**

    * `-50` dB - Detects very quiet moments (more segments)
    * `-40` dB - Balanced (default)
    * `-30` dB - Only detects near-silence (fewer segments)

    **Characteristics:**

    * Variable segment count (typically 5-30 per minute)
    * Segments align with natural speech pauses
    * Preserves complete sentences/thoughts

    **Best for:** Lectures, presentations, conversations, podcasts, interviews

    ```json theme={null}
    {
      "split_method": "silence",
      "silence_db_threshold": -40
    }
    ```
  </Tab>
</Tabs>

#### Split Methods Comparison

| Method    | Segments/Min       | Predictability | Best For                            |
| --------- | ------------------ | -------------- | ----------------------------------- |
| `time`    | 60 / interval\_sec | High           | General purpose, batch processing   |
| `scene`   | Variable (2-20)    | Low            | Movies, ads, dynamic visual content |
| `silence` | Variable (5-30)    | Medium         | Lectures, podcasts, spoken content  |

### Feature Extraction Parameters

| Parameter                     | Type    | Default                    | Description                                      |
| ----------------------------- | ------- | -------------------------- | ------------------------------------------------ |
| `run_transcription`           | boolean | `true` (v1) / `false` (v2) | Run Whisper transcription on audio               |
| `transcription_language`      | string  | `"en"`                     | Language for transcription                       |
| `run_transcription_embedding` | boolean | `true` (v1) / `false` (v2) | Generate E5 embeddings for transcriptions        |
| `run_multimodal_embedding`    | boolean | `true`                     | Generate multimodal embeddings                   |
| `run_video_description`       | boolean | `false`                    | Generate AI descriptions (adds 1-2s per segment) |
| `run_ocr`                     | boolean | `false`                    | Extract text from video frames                   |

### Thumbnail Parameters

| Parameter           | Type    | Default | Description                       |
| ------------------- | ------- | ------- | --------------------------------- |
| `enable_thumbnails` | boolean | `true`  | Generate thumbnail images         |
| `use_cdn`           | boolean | `false` | Use CloudFront CDN for thumbnails |

**CDN benefits**: Faster global delivery, permanent URLs, reduced bandwidth costs.

### v2-Only Parameters

These parameters are only available on `multimodal_extractor` v2:

| Parameter               | Type    | Default                | Description                                                                                             |
| ----------------------- | ------- | ---------------------- | ------------------------------------------------------------------------------------------------------- |
| `output_dimensionality` | integer | `3072`                 | Embedding dimensions. Gemini Embedding 2 supports Matryoshka reduction: `3072` (full), `1536`, or `768` |
| `task_type`             | string  | `"RETRIEVAL_DOCUMENT"` | Embedding task hint: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY`, `CLASSIFICATION`   |

<Info>
  At query time, Mixpeek automatically uses `RETRIEVAL_QUERY` — you only need to set `task_type` at index time. The default `RETRIEVAL_DOCUMENT` is correct for most use cases.
</Info>

### Embedding Task

When `run_transcription_embedding` is enabled, the E5 model generates text embeddings from transcribed audio. By default, these use `retrieval_document` for asymmetric search.

<Note>
  Set `embedding_task` at the **collection level**, not on the extractor. See [Collection Embedding Task](/platform/processing#embedding-task) for full details and examples.
</Note>

<Info>
  This only affects the E5 transcription embeddings. Vertex AI multimodal embeddings (v1) and Gemini Embedding 2 (v2) are not instruction-aware and ignore this parameter.
</Info>

### Description Generation Parameters

| Parameter                             | Type    | Default                                   | Description                         |
| ------------------------------------- | ------- | ----------------------------------------- | ----------------------------------- |
| `description_prompt`                  | string  | `"Describe the video segment in detail."` | Prompt for Gemini                   |
| `generation_config.temperature`       | float   | `0.7`                                     | Randomness (higher = more creative) |
| `generation_config.max_output_tokens` | integer | `1024`                                    | Maximum description length          |
| `generation_config.top_p`             | float   | `0.8`                                     | Nucleus sampling                    |

### LLM Structured Extraction

| Parameter        | Type             | Default | Description                     |
| ---------------- | ---------------- | ------- | ------------------------------- |
| `response_shape` | string \| object | `null`  | Custom structured output schema |

**Natural Language Mode:**

```json theme={null}
{
  "response_shape": "Extract product names, colors, materials, and aesthetic style labels from this fashion segment"
}
```

**JSON Schema Mode:**

```json theme={null}
{
  "response_shape": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "category": { "type": "string" },
            "visibility_percentage": { "type": "integer", "minimum": 0, "maximum": 100 }
          }
        }
      },
      "aesthetic": { "type": "string" }
    }
  }
}
```

## Configuration Examples

<CodeGroup>
  ```json v1 — Video with Time-Based Splitting theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.video_id" }
      ],
      "parameters": {
        "split_method": "time",
        "time_split_interval": 10,
        "run_transcription": true,
        "run_multimodal_embedding": true,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json v1 — Video with Scene Detection theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {
        "split_method": "scene",
        "scene_detection_threshold": 0.5,
        "run_transcription": true,
        "run_video_description": true,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json v1 — Lecture Video with Silence Splitting theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "lecture_url"
      },
      "parameters": {
        "split_method": "silence",
        "silence_db_threshold": -40,
        "run_transcription": true,
        "transcription_language": "en",
        "run_ocr": true,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json v2 — Video with Gemini Embedding 2 theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v2",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {
        "split_method": "time",
        "time_split_interval": 10,
        "run_multimodal_embedding": true,
        "output_dimensionality": 3072,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json v2 — Compact Embeddings (768D) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v2",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {
        "split_method": "scene",
        "scene_detection_threshold": 0.5,
        "run_multimodal_embedding": true,
        "output_dimensionality": 768,
        "run_transcription": true,
        "run_transcription_embedding": true,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json Image Embedding (v1 or v2) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "image": "image_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.product_id" }
      ],
      "parameters": {
        "run_multimodal_embedding": true,
        "enable_thumbnails": true
      }
    }
  }
  ```

  ```json Text Embedding (Cross-Modal Search) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "product_description"
      },
      "parameters": {
        "run_multimodal_embedding": true
      }
    }
  }
  ```

  ```json v1 — Full Extraction with All Features theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "video_url"
      },
      "parameters": {
        "split_method": "scene",
        "scene_detection_threshold": 0.5,
        "run_transcription": true,
        "run_transcription_embedding": true,
        "run_multimodal_embedding": true,
        "run_video_description": true,
        "run_ocr": true,
        "enable_thumbnails": true,
        "use_cdn": true,
        "description_prompt": "Describe what is happening in this video segment, including any visible products, people, and actions.",
        "generation_config": {
          "temperature": 0.7,
          "max_output_tokens": 1024
        }
      }
    }
  }
  ```

  ```json Fashion/E-commerce with Structured Extraction theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "video": "fashion_video_url"
      },
      "parameters": {
        "split_method": "scene",
        "run_multimodal_embedding": true,
        "run_video_description": true,
        "response_shape": {
          "type": "object",
          "properties": {
            "products": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": { "type": "string" },
                  "category": { "type": "string" },
                  "color": { "type": "string" },
                  "visibility_percentage": { "type": "integer" }
                }
              }
            },
            "aesthetic": { "type": "string" },
            "setting": { "type": "string" }
          }
        }
      }
    }
  }
  ```
</CodeGroup>

## Performance & Costs

### Processing Speed

| Content Type | Speed                                         |
| ------------ | --------------------------------------------- |
| Video        | 0.5-2x realtime (depends on features enabled) |
| Image        | \< 1 second                                   |
| Text         | \< 100ms                                      |

**Example**: 10-minute video → 5-20 minutes processing time

| Feature          | Latency per Segment         |
| ---------------- | --------------------------- |
| Transcription    | \~200ms per second of audio |
| Visual embedding | \~50ms                      |
| OCR              | \~300ms                     |
| Description      | \~2s                        |

### Cost Estimates (per minute of video)

| Configuration                            | Cost   |
| ---------------------------------------- | ------ |
| **Minimal** (transcription + embeddings) | \$0.01 |
| **Standard** (+ OCR)                     | \$0.05 |
| **Full** (+ descriptions)                | \$0.15 |

**Images**: $0.001 per image **Text**: $0.0001 per query

## Vector Indexes

<Tabs>
  <Tab title="v1">
    ### Multimodal Embedding

    | Property             | Value                                                           |
    | -------------------- | --------------------------------------------------------------- |
    | **Feature URI**      | `mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding` |
    | **Index name**       | `multimodal_extractor_v1_multimodal_embedding`                  |
    | **Dimensions**       | 1408                                                            |
    | **Type**             | Dense                                                           |
    | **Distance metric**  | Cosine                                                          |
    | **Inference model**  | `vertex_multimodal_embedding`                                   |
    | **Supported inputs** | video, text, image                                              |

    ### Transcription Embedding

    | Property             | Value                                                                 |
    | -------------------- | --------------------------------------------------------------------- |
    | **Feature URI**      | `mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1` |
    | **Index name**       | `multimodal_extractor_v1_transcription_embedding`                     |
    | **Dimensions**       | 1024                                                                  |
    | **Type**             | Dense                                                                 |
    | **Distance metric**  | Cosine                                                                |
    | **Inference model**  | `multilingual_e5_large_instruct_v1`                                   |
    | **Supported inputs** | text, string                                                          |

    <Note>In retrievers, reference these by their **Feature URI** — the output name is the model name, **not** the `multimodal_extractor_v1_*` index name.</Note>
  </Tab>

  <Tab title="v2">
    ### Multimodal Embedding

    | Property             | Value                                          |
    | -------------------- | ---------------------------------------------- |
    | **Index name**       | `multimodal_extractor_v2_multimodal_embedding` |
    | **Dimensions**       | 3072 (configurable: 1536, 768)                 |
    | **Type**             | Dense                                          |
    | **Distance metric**  | Cosine                                         |
    | **Inference model**  | `google/gemini-embedding-2`                    |
    | **Supported inputs** | video, text, image, audio                      |

    ### Transcription Embedding

    | Property             | Value                                             |
    | -------------------- | ------------------------------------------------- |
    | **Index name**       | `multimodal_extractor_v2_transcription_embedding` |
    | **Dimensions**       | 1024                                              |
    | **Type**             | Dense                                             |
    | **Distance metric**  | Cosine                                            |
    | **Inference model**  | `intfloat/multilingual-e5-large-instruct`         |
    | **Supported inputs** | text, string                                      |
  </Tab>
</Tabs>

## Choosing v1 vs v2

| Consideration           | v1                     | v2                                         |
| ----------------------- | ---------------------- | ------------------------------------------ |
| **Embedding quality**   | Good                   | Better (natively multimodal)               |
| **Dimensions**          | 1408 (fixed)           | 3072, 1536, or 768 (configurable)          |
| **Storage per vector**  | 5.5 KB                 | 12 KB (3072D), 6 KB (1536D), 3 KB (768D)   |
| **Audio input support** | Via transcription only | Native audio embedding                     |
| **Matryoshka support**  | No                     | Yes — reduce dimensions without reindexing |
| **Stability**           | Production-proven      | Newer                                      |

**Recommendation:** Use **v2** for new projects. Use **v1** if you have existing collections and don't need higher dimensions or native audio embedding.

## Limitations

* **Video duration**: Recommend \< 2 hours for optimal processing
* **Resolution**: 8K+ videos should be downsampled
* **Real-time**: Not suitable for live streaming
* **Short videos**: \< 5 second videos have disproportionate overhead
* **Audio quality**: Transcription accuracy depends on audio clarity
* **OCR/Description**: Add significant processing time, enable only when needed

## Collection-to-Collection Pipelines

The `video_segment_url` output enables decomposition chains:

1. **Initial collection**: Time-based segments (5s intervals)
2. **Downstream collection**: Scene detection within each segment
3. **Final collection**: Enhanced processing with different models

```json theme={null}
{
  "input_mappings": {
    "video": "video_segment_url"
  }
}
```

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Gemini Multifile Extractor](/processing/extractors/gemini-multifile) — Embed multiple files per object into one vector
* [Passthrough Extractor](/processing/extractors/passthrough)
* [Text Extractor](/processing/extractors/text)
