Gemini Multifile Extractor

Built-in extractor names are a deprecated alias — collections are now created by picking features. This pipeline is selected with features: ["multimodal_understanding"]. Existing feature_extractor configs keep working; see the migration guide.

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.

The Gemini Multifile Extractor uses Gemini Embedding 2 (gemini-embedding-exp-03-07, 3072-d) to embed all files of an object — images, PDFs, video, audio, and text — into one unified vector per object in a single API call. This is fundamentally different from other extractors: instead of producing one document per file, it collapses all of an object’s blobs into a single embedding representing the object as a whole.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/gemini_multifile_extractor_v1

When to Use

Use Case	Example
Product catalogs	Embed product photo + spec sheet + description together
Medical records	Embed scan + report + clinical notes as one object
Legal documents	Embed contract + exhibits + summary together
E-commerce	Embed product image + manual + label as one searchable unit
Object-level search	Find objects similar to a combination of inputs

When NOT to Use

Scenario	Recommended Alternative
Per-file granularity needed (search within a PDF)	`document_extractor` or `multimodal_extractor`
Single-file objects	`image_extractor` or `text_extractor`
Video frame-level search	`multimodal_extractor`

How It Works

Standard extractors produce one document per blob (file). The Gemini Multifile Extractor uses array input_mappings to collect multiple blob fields from one object into a single list, then sends all of them to Gemini Embedding 2 in one API call:

Object (hero_image + spec_sheet + description)
         ↓  array input_mappings
Single Gemini API call with all 3 parts
         ↓
One 3072-d embedding → One document per object

Array `input_mappings`

The key difference from other extractors is the input_mappings value: instead of mapping one field, you map a list of fields. All listed fields are collected from each object and embedded together.

{
  "input_mappings": {
    "files": ["hero_image", "spec_sheet", "description"]
  }
}

The key ("files") is the extractor’s input name. The value is a list of blob field names from your bucket schema. Each field can be an image, PDF, video, audio, or text blob.

Output

Field	Type	Description
`gemini_multifile_extractor_v1_embedding`	`float[]` (3072-d)	Unified embedding for all input files
`source_blob_count`	`int`	Number of blobs embedded together
`source_blob_properties`	`string[]`	Names of the blob fields that contributed

Parameters

Parameter	Type	Default	Description
`output_dimensionality`	integer	`3072`	Embedding dimensions: `3072`, `768`, or `256`
`task_type`	string	`RETRIEVAL_DOCUMENT`	Embedding task hint for Gemini Embedding 2. Values: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY`, `CLASSIFICATION`, `CLUSTERING`
`input_key`	string	`files`	Must match the key in `input_mappings`

At query time, Mixpeek automatically uses RETRIEVAL_QUERY — you only need to set task_type at index time. The default RETRIEVAL_DOCUMENT is correct for most use cases. See Text Extractor embedding task docs for details on each value.

Complete Collection Setup

Define your bucket schema with multiple blob fields

cURL

curl -X POST "https://api.mixpeek.com/v1/buckets" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "product-catalog",
    "bucket_schema": {
      "properties": {
        "product_id":   {"type": "string"},
        "hero_image":   {"type": "image"},
        "spec_sheet":   {"type": "pdf"},
        "description":  {"type": "text"}
      }
    }
  }'

Create a collection with array input_mappings

cURL

curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-embeddings",
    "source": { "type": "bucket", "bucket_ids": ["bkt_abc123"] },
    "feature_extractor": {
      "feature_extractor_name": "gemini_multifile_extractor",
      "version": "v1",
      "input_mappings": {
        "files": ["hero_image", "spec_sheet", "description"]
      }
    }
  }'

The value of "files" is a list of field names, not a single string. This is what triggers the multi-file embedding behavior. All three fields are embedded together into one vector.

Upload objects to the bucket

Each object must have all the mapped fields populated:

cURL

curl -X POST "https://api.mixpeek.com/v1/buckets/$BUCKET_ID/objects" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "blobs": [
      { "property": "hero_image", "type": "image", "data": "s3://my-bucket/products/sku-001/hero.jpg" },
      { "property": "spec_sheet", "type": "pdf", "data": "s3://my-bucket/products/sku-001/spec.pdf" },
      { "property": "description", "type": "text", "data": "Lightweight carbon-fiber trail running shoe with Vibram outsole" }
    ],
    "metadata": { "product_id": "SKU-001" }
  }'

Process the batch

cURL

curl -X POST "https://api.mixpeek.com/v1/batches" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_id": "bkt_abc123",
    "collection_ids": ["col_xyz789"]
  }'

Each object in the bucket produces one document containing one 3072-d embedding representing all its files combined.

Creating a Retriever

After indexing, create a retriever to search the embedded objects. The feature_search stage supports two query input modes for this extractor:

Single-item query (`text` or `content` mode)

Query with a single URL or text string — Gemini embeds it as-is and searches:

cURL

curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "product-object-search",
    "collection_identifiers": ["col_xyz789"],
    "input_schema": {
      "query": {"type": "text", "description": "Search query", "required": true}
    },
    "stages": [
      {
        "stage_name": "Object Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
                "query": {
                  "input_mode": "text",
                  "value": "{{INPUT.query}}"
                },
                "top_k": 20
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

Multi-file query (`multi_content` mode)

Query with multiple files at once — Gemini embeds all of them together, matching how objects were indexed. This produces the most accurate similarity scores because the query vector is built the same way as the index vectors.

cURL

curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "product-multifile-search",
    "collection_identifiers": ["col_xyz789"],
    "input_schema": {
      "image_url":   {"type": "text", "description": "Product image URL"},
      "description": {"type": "text", "description": "Product description text"}
    },
    "stages": [
      {
        "stage_name": "Multi-file Object Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
                "query": {
                  "input_mode": "multi_content",
                  "values": [
                    "{{INPUT.image_url}}",
                    "{{INPUT.description}}"
                  ]
                },
                "top_k": 20
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

Execute it by passing multiple files in inputs:

cURL

curl -X POST "https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "image_url": "https://example.com/query-shoe.jpg",
      "description": "trail running shoe lightweight"
    }
  }'

multi_content is only valid for feature URIs backed by gemini_multifile_extractor. Using it with any other extractor returns a 400 error.

Dimensionality Reduction

Gemini Embedding 2 supports truncated dimensions for storage cost reduction:

{
  "feature_extractor": {
    "feature_extractor_name": "gemini_multifile_extractor",
    "version": "v1",
    "input_mappings": {
      "files": ["hero_image", "spec_sheet", "description"]
    },
    "params": {
      "output_dimensionality": 768
    }
  }
}

Dimensions	Storage per vector	Quality
`3072` (default)	12 KB	Best
`768`	3 KB	Near-identical
`256`	1 KB	Good — ~2% quality loss

Pricing

See Billing & Pricing — rates come from GET /v1/billing/pricing. One charge covers all N files within an object — not per file.

Feature Search Stage — Search with multi_content query mode
Image Extractor — Per-image embeddings (one document per image)
Multimodal Extractor — Per-segment video/audio embeddings
Text Extractor — Per-chunk text embeddings

View on GitHub

​When to Use

​When NOT to Use

​How It Works

​Array input_mappings

​Output

​Parameters

​Complete Collection Setup

​Creating a Retriever

​Single-item query (text or content mode)

​Multi-file query (multi_content mode)

​Dimensionality Reduction

​Pricing

​Related

When to Use

When NOT to Use

How It Works

Array `input_mappings`

Output

Parameters

Complete Collection Setup

Creating a Retriever

Single-item query (`text` or `content` mode)

Multi-file query (`multi_content` mode)

Dimensionality Reduction

Pricing

Related