Skip to main content
The Gemini Multifile Extractor uses Gemini Embedding 2 (gemini-embedding-exp-03-07, 3072-d) to embed all files of an object — images, PDFs, video, audio, and text — into one unified vector per object in a single API call. This is fundamentally different from other extractors: instead of producing one document per file, it collapses all of an object’s blobs into a single embedding representing the object as a whole.

When to Use

Use CaseExample
Product catalogsEmbed product photo + spec sheet + description together
Medical recordsEmbed scan + report + clinical notes as one object
Legal documentsEmbed contract + exhibits + summary together
E-commerceEmbed product image + manual + label as one searchable unit
Object-level searchFind objects similar to a combination of inputs

When NOT to Use

ScenarioRecommended Alternative
Per-file granularity needed (search within a PDF)document_extractor or multimodal_extractor
Single-file objectsimage_extractor or text_extractor
Video frame-level searchmultimodal_extractor

How It Works

Standard extractors produce one document per blob (file). The Gemini Multifile Extractor uses array input_mappings to collect multiple blob fields from one object into a single list, then sends all of them to Gemini Embedding 2 in one API call:
Object (hero_image + spec_sheet + description)
         ↓  array input_mappings
Single Gemini API call with all 3 parts

One 3072-d embedding → One document per object

Array input_mappings

The key difference from other extractors is the input_mappings value: instead of mapping one field, you map a list of fields. All listed fields are collected from each object and embedded together.
{
  "input_mappings": {
    "files": ["hero_image", "spec_sheet", "description"]
  }
}
The key ("files") is the extractor’s input name. The value is a list of blob field names from your bucket schema. Each field can be an image, PDF, video, audio, or text blob.

Output

FieldTypeDescription
gemini_multifile_extractor_v1_embeddingfloat[] (3072-d)Unified embedding for all input files
source_blob_countintNumber of blobs embedded together
source_blob_propertiesstring[]Names of the blob fields that contributed

Parameters

ParameterTypeDefaultDescription
output_dimensionalityinteger3072Embedding dimensions: 3072, 768, or 256
task_typestringRETRIEVAL_DOCUMENTGemini task type. Use RETRIEVAL_DOCUMENT for indexing
input_keystringfilesMust match the key in input_mappings

Complete Collection Setup

1

Define your bucket schema with multiple blob fields

cURL
curl -X POST "https://api.mixpeek.com/v1/buckets" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "product-catalog",
    "schema": {
      "properties": {
        "product_id":   {"type": "string"},
        "hero_image":   {"type": "image"},
        "spec_sheet":   {"type": "pdf"},
        "description":  {"type": "text"}
      }
    }
  }'
2

Create a collection with array input_mappings

cURL
curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-embeddings",
    "bucket_id": "bkt_abc123",
    "feature_extractor": {
      "feature_extractor_name": "gemini_multifile_extractor",
      "version": "v1",
      "input_mappings": {
        "files": ["hero_image", "spec_sheet", "description"]
      }
    }
  }'
The value of "files" is a list of field names, not a single string. This is what triggers the multi-file embedding behavior. All three fields are embedded together into one vector.
3

Upload objects to the bucket

Each object must have all the mapped fields populated:
cURL
curl -X POST "https://api.mixpeek.com/v1/buckets/$BUCKET_ID/objects" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "product_id": "SKU-001",
      "hero_image": "s3://my-bucket/products/sku-001/hero.jpg",
      "spec_sheet": "s3://my-bucket/products/sku-001/spec.pdf",
      "description": "Lightweight carbon-fiber trail running shoe with Vibram outsole"
    }
  }'
4

Process the batch

cURL
curl -X POST "https://api.mixpeek.com/v1/batches" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_id": "bkt_abc123",
    "collection_ids": ["col_xyz789"]
  }'
Each object in the bucket produces one document containing one 3072-d embedding representing all its files combined.

Creating a Retriever

After indexing, create a retriever to search the embedded objects. The feature_search stage supports two query input modes for this extractor:

Single-item query (text or content mode)

Query with a single URL or text string — Gemini embeds it as-is and searches:
cURL
curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "product-object-search",
    "collection_identifiers": ["col_xyz789"],
    "input_schema": {
      "query": {"type": "text", "description": "Search query", "required": true}
    },
    "stages": [
      {
        "stage_name": "Object Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
                "query": {
                  "input_mode": "text",
                  "value": "{{INPUT.query}}"
                },
                "top_k": 20
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

Multi-file query (multi_content mode)

Query with multiple files at once — Gemini embeds all of them together, matching how objects were indexed. This produces the most accurate similarity scores because the query vector is built the same way as the index vectors.
cURL
curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "product-multifile-search",
    "collection_identifiers": ["col_xyz789"],
    "input_schema": {
      "image_url":   {"type": "text", "description": "Product image URL"},
      "description": {"type": "text", "description": "Product description text"}
    },
    "stages": [
      {
        "stage_name": "Multi-file Object Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
                "query": {
                  "input_mode": "multi_content",
                  "values": [
                    "{{INPUT.image_url}}",
                    "{{INPUT.description}}"
                  ]
                },
                "top_k": 20
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'
Execute it by passing multiple files in inputs:
cURL
curl -X POST "https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "image_url": "https://example.com/query-shoe.jpg",
      "description": "trail running shoe lightweight"
    }
  }'
multi_content is only valid for feature URIs backed by gemini_multifile_extractor. Using it with any other extractor returns a 400 error.

Dimensionality Reduction

Gemini Embedding 2 supports truncated dimensions for storage cost reduction:
{
  "feature_extractor": {
    "feature_extractor_name": "gemini_multifile_extractor",
    "version": "v1",
    "input_mappings": {
      "files": ["hero_image", "spec_sheet", "description"]
    },
    "params": {
      "output_dimensionality": 768
    }
  }
}
DimensionsStorage per vectorQuality
3072 (default)12 KBBest
7683 KBNear-identical
2561 KBGood — ~2% quality loss

Pricing

TierCredits per object
ADVANCED10 credits
One credit charge covers all N files within an object — not per file.