> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Universal Extractor

> All-in-one multimodal extractor — image, video, audio, and documents — producing 3072-d Gemini embeddings plus text descriptions, OCR, and transcription

<Card title="View on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors/blob/main/extractors/universal_extractor/README.md" horizontal>
  Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
</Card>

The universal extractor is an all-in-one feature extractor that handles **image, video, audio, and documents** through Google's Gemini APIs. It produces a single 3072-dimensional embedding (Gemini Embedding 2) per object alongside rich text extraction — AI-generated descriptions, OCR for images and documents, and transcription for audio and video. It runs on Celery (not Ray) for zero cluster-startup latency, making it a fast path for mixed-modality corpora.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/universal\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/universal_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Resolve input** — apply `input_mappings` to get the file URL/path from the source object (`content` field).
2. **Detect modality** — classify the object as image, video, audio, or document.
3. **Segment (if needed)** — video is processed in up to `max_video_segments` 30s segments; documents up to `max_document_pages` pages.
4. **Gemini embedding** — generate a 3072-d Gemini Embedding 2 vector (`output_dimensionality` configurable 256–3072).
5. **Text extraction** (if `extract_text`) — OCR for images/documents, transcription for audio/video.
6. **Description** (if `generate_description`) — Gemini vision/understanding produces a natural-language description.
7. **Output** — one document per object (or per segment/page for chunked content).

## When to Use

| Use Case                   | Description                                                                                      |
| -------------------------- | ------------------------------------------------------------------------------------------------ |
| **Mixed-modality corpora** | A single bucket containing images, video, audio, and PDFs you want searchable with one extractor |
| **Fast onboarding**        | Celery fast-path avoids Ray cluster startup, so small batches return quickly                     |
| **Cross-modal search**     | One shared 3072-d embedding space across all four modalities                                     |
| **Rich metadata**          | Need descriptions, OCR text, and transcription alongside the vector                              |

## When NOT to Use

| Scenario                                   | Recommended Alternative                                           |
| ------------------------------------------ | ----------------------------------------------------------------- |
| High-volume single-modality at lowest cost | Modality-specific extractor (`text_extractor`, `image_extractor`) |
| Audio fingerprinting / sound-mark matching | `audio_fingerprint_extractor`                                     |
| Spatial/layout document analysis           | `document_graph_extractor`                                        |
| Self-hosted, no external API calls         | `text_extractor` / `image_extractor`                              |

## Input Schema

| Field     | Type   | Required | Description                                                          |
| --------- | ------ | -------- | -------------------------------------------------------------------- |
| `content` | string | **Yes**  | URL or path to the file to process. Populated from `input_mappings`. |

```json theme={null}
{
  "content": "s3://my-bucket/assets/report.pdf"
}
```

Supported input types: **IMAGE, VIDEO, AUDIO, PDF, TEXT, STRING**.

## Output Schema

| Field                              | Type            | Description                                                 |
| ---------------------------------- | --------------- | ----------------------------------------------------------- |
| `universal_extractor_v1_embedding` | float\[3072]    | Gemini Embedding 2 vector for the content                   |
| `modality`                         | string          | Detected modality: `image`, `video`, `audio`, or `document` |
| `text`                             | string \| null  | Extracted text (OCR, transcription, or document text)       |
| `description`                      | string \| null  | AI-generated description of the content                     |
| `segment_index`                    | integer \| null | Segment index (chunked video/audio/documents)               |
| `segment_total`                    | integer \| null | Total segments for this source object                       |
| `page_number`                      | integer \| null | Page number (documents only)                                |
| `start_time_s` / `end_time_s`      | float \| null   | Segment start/end time in seconds (video/audio)             |
| `duration_s`                       | float \| null   | Total file duration in seconds (video/audio)                |

```json theme={null}
{
  "universal_extractor_v1_embedding": [0.012, -0.034, 0.008, ...],
  "modality": "document",
  "text": "Quarterly revenue grew 12% year over year...",
  "description": "A financial report page with a revenue bar chart",
  "page_number": 1,
  "segment_total": 12
}
```

## Parameters

| Parameter               | Type    | Default                | Range    | Description                                                                                                            |
| ----------------------- | ------- | ---------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------- |
| `output_dimensionality` | integer | `3072`                 | 256–3072 | Output embedding dimensions (Gemini Embedding 2 supports 256–3072)                                                     |
| `task_type`             | string  | `"RETRIEVAL_DOCUMENT"` | —        | Embedding intent for Gemini Embedding 2. Common values: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY` |
| `generate_description`  | boolean | `true`                 | —        | Generate a text description via Gemini vision/understanding                                                            |
| `extract_text`          | boolean | `true`                 | —        | Extract text (OCR for images/docs, transcription for audio/video)                                                      |
| `max_video_segments`    | integer | `10`                   | 1–50     | Maximum number of 30s segments to process for video files                                                              |
| `max_document_pages`    | integer | `50`                   | 1–200    | Maximum number of pages to process for document files                                                                  |
| `max_file_download_mb`  | integer | `500`                  | 1–1024   | Maximum file download size (MB) for Celery fast-path processing                                                        |
| `max_concurrency`       | integer | `4`                    | 1–32     | Maximum per-task object concurrency for Celery fast-path processing                                                    |

<Warning>
  Dimensions are locked at namespace creation. Switching `output_dimensionality` on an existing namespace requires a migration since the vector index dimensionality is fixed.
</Warning>

## Configuration Examples

<CodeGroup>
  ```json All-in-One Defaults theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "universal_extractor",
      "version": "v1",
      "input_mappings": {
        "content": "file_url"
      },
      "parameters": {}
    }
  }
  ```

  ```json Embeddings Only (No Text/Description) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "universal_extractor",
      "version": "v1",
      "input_mappings": {
        "content": "file_url"
      },
      "parameters": {
        "generate_description": false,
        "extract_text": false
      }
    }
  }
  ```

  ```json Compact Embeddings for Clustering theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "universal_extractor",
      "version": "v1",
      "input_mappings": {
        "content": "file_url"
      },
      "parameters": {
        "output_dimensionality": 768,
        "task_type": "CLUSTERING"
      }
    }
  }
  ```

  ```json Long-Form Video & Documents theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "universal_extractor",
      "version": "v1",
      "input_mappings": {
        "content": "file_url"
      },
      "parameters": {
        "max_video_segments": 30,
        "max_document_pages": 200
      }
    }
  }
  ```
</CodeGroup>

## Performance & Costs

| Metric           | Value                                                                                          |
| ---------------- | ---------------------------------------------------------------------------------------------- |
| **Compute**      | Celery fast-path (no Ray cluster startup)                                                      |
| **Cost**         | 15 credits per object (covers all Gemini API calls: embedding, description, OCR/transcription) |
| **External API** | Google Gemini (embedding + vision/understanding)                                               |
| **Max download** | 500 MB per object (configurable to 1024 MB)                                                    |

## Vector Index

| Property            | Value                              |
| ------------------- | ---------------------------------- |
| **Index name**      | `universal_extractor_v1_embedding` |
| **Dimensions**      | 3072 (configurable 256–3072)       |
| **Type**            | Dense                              |
| **Distance metric** | Cosine                             |
| **Inference model** | `google/gemini-embedding-2`        |

## Limitations

* **External dependency**: Requires Google Gemini API availability; subject to its rate limits.
* **Per-object cost**: Higher per-object cost than self-hosted single-modality extractors.
* **Segment/page caps**: Video beyond `max_video_segments` and documents beyond `max_document_pages` are truncated.
* **Download ceiling**: Files larger than `max_file_download_mb` are skipped on the Celery fast-path.

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Multimodal Extractor](/processing/extractors/multimodal)
* [Gemini Multifile Extractor](/processing/extractors/gemini-multifile)
* [Audio Fingerprint Extractor](/processing/extractors/audio-fingerprint)
