Universal Extractor

Built-in extractor names are a deprecated alias — collections are now created by picking features. This pipeline is selected with features: ["video_search"]. Existing feature_extractor configs keep working; see the migration guide.

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.

The universal extractor is an all-in-one feature extractor that handles image, video, audio, and documents through Google’s Gemini APIs. It produces a single 3072-dimensional embedding (Gemini Embedding 2) per object alongside rich text extraction — AI-generated descriptions, OCR for images and documents, and transcription for audio and video. It runs on Celery (not Ray) for zero cluster-startup latency, making it a fast path for mixed-modality corpora.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/universal_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

Resolve input — apply input_mappings to get the file URL/path from the source object (content field).
Detect modality — classify the object as image, video, audio, or document.
Segment (if needed) — video is processed in up to max_video_segments 30s segments; documents up to max_document_pages pages.
Gemini embedding — generate a 3072-d Gemini Embedding 2 vector (output_dimensionality configurable 256–3072).
Text extraction (if extract_text) — OCR for images/documents, transcription for audio/video.
Description (if generate_description) — Gemini vision/understanding produces a natural-language description.
Output — one document per object (or per segment/page for chunked content).

When to Use

Use Case	Description
Mixed-modality corpora	A single bucket containing images, video, audio, and PDFs you want searchable with one extractor
Fast onboarding	Celery fast-path avoids Ray cluster startup, so small batches return quickly
Cross-modal search	One shared 3072-d embedding space across all four modalities
Rich metadata	Need descriptions, OCR text, and transcription alongside the vector

When NOT to Use

Scenario	Recommended Alternative
High-volume single-modality at lowest cost	Modality-specific extractor (`text_extractor`, `image_extractor`)
Audio fingerprinting / sound-mark matching	`audio_fingerprint_extractor`
Spatial/layout document analysis	`document_graph_extractor`
Self-hosted, no external API calls	`text_extractor` / `image_extractor`

Input Schema

Field	Type	Required	Description
`content`	string	Yes	URL or path to the file to process. Populated from `input_mappings`.

{
  "content": "s3://my-bucket/assets/report.pdf"
}

Supported input types: IMAGE, VIDEO, AUDIO, PDF, TEXT, STRING.

Output Schema

Field	Type	Description
`universal_extractor_v1_embedding`	float[3072]	Gemini Embedding 2 vector for the content
`modality`	string	Detected modality: `image`, `video`, `audio`, or `document`
`text`	string \| null	Extracted text (OCR, transcription, or document text)
`description`	string \| null	AI-generated description of the content
`segment_index`	integer \| null	Segment index (chunked video/audio/documents)
`segment_total`	integer \| null	Total segments for this source object
`page_number`	integer \| null	Page number (documents only)
`start_time_s` / `end_time_s`	float \| null	Segment start/end time in seconds (video/audio)
`duration_s`	float \| null	Total file duration in seconds (video/audio)

{
  "universal_extractor_v1_embedding": [0.012, -0.034, 0.008, ...],
  "modality": "document",
  "text": "Quarterly revenue grew 12% year over year...",
  "description": "A financial report page with a revenue bar chart",
  "page_number": 1,
  "segment_total": 12
}

Parameters

Parameter	Type	Default	Range	Description
`output_dimensionality`	integer	`3072`	256–3072	Output embedding dimensions (Gemini Embedding 2 supports 256–3072)
`task_type`	string	`"RETRIEVAL_DOCUMENT"`	—	Embedding intent for Gemini Embedding 2. Common values: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY`
`generate_description`	boolean	`true`	—	Generate a text description via Gemini vision/understanding
`extract_text`	boolean	`true`	—	Extract text (OCR for images/docs, transcription for audio/video)
`max_video_segments`	integer	`10`	1–50	Maximum number of 30s segments to process for video files
`max_document_pages`	integer	`50`	1–200	Maximum number of pages to process for document files
`max_file_download_mb`	integer	`500`	1–1024	Maximum file download size (MB) for Celery fast-path processing
`max_concurrency`	integer	`4`	1–32	Maximum per-task object concurrency for Celery fast-path processing

Dimensions are locked at namespace creation. Switching output_dimensionality on an existing namespace requires a migration since the vector index dimensionality is fixed.

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "universal_extractor",
    "version": "v1",
    "input_mappings": {
      "content": "file_url"
    },
    "parameters": {}
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "universal_extractor",
    "version": "v1",
    "input_mappings": {
      "content": "file_url"
    },
    "parameters": {
      "generate_description": false,
      "extract_text": false
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "universal_extractor",
    "version": "v1",
    "input_mappings": {
      "content": "file_url"
    },
    "parameters": {
      "output_dimensionality": 768,
      "task_type": "CLUSTERING"
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "universal_extractor",
    "version": "v1",
    "input_mappings": {
      "content": "file_url"
    },
    "parameters": {
      "max_video_segments": 30,
      "max_document_pages": 200
    }
  }
}

Performance & Costs

Metric	Value
Compute	Celery fast-path (no Ray cluster startup)
Cost	See Billing & Pricing — rates come from `GET /v1/billing/pricing`. One charge covers all Gemini API calls (embedding, description, OCR/transcription)
External API	Google Gemini (embedding + vision/understanding)
Max download	500 MB per object (configurable to 1024 MB)

Vector Index

Property	Value
Index name	`universal_extractor_v1_embedding`
Dimensions	3072 (configurable 256–3072)
Type	Dense
Distance metric	Cosine
Inference model	`google/gemini-embedding-2`

Limitations

External dependency: Requires Google Gemini API availability; subject to its rate limits.
Per-object cost: Higher per-object cost than self-hosted single-modality extractors.
Segment/page caps: Video beyond max_video_segments and documents beyond max_document_pages are truncated.
Download ceiling: Files larger than max_file_download_mb are skipped on the Celery fast-path.

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Universal Extractor

View on GitHub

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Configuration Examples

Performance & Costs

Vector Index

Limitations

View on GitHub

​Pipeline Steps

​When to Use

​When NOT to Use

​Input Schema

​Output Schema

​Parameters

​Configuration Examples

​Performance & Costs

​Vector Index

​Limitations

​Related

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Configuration Examples

Performance & Costs

Vector Index

Limitations

Related