Feature Search

Feature Search stage showing multi-vector semantic search with fusion

The Feature Search stage is the primary search stage for retrieval pipelines. It performs vector similarity search across one or more embedding features, supporting single-modal, multimodal, and hybrid search patterns. Results from multiple searches are fused using configurable strategies (RRF, DBSF, weighted, max, or learned).

Stage Category: FILTER (Retrieves documents)Transformation: 0 documents → N documents (retrieves from collection based on vector similarity)

When to Use

Use Case	Description
Semantic search	Find documents similar in meaning to a query
Image search	Search by image embeddings
Video search	Search by video frame embeddings
Multimodal search	Combine text + image + video in one query
Hybrid search	Fuse results from multiple embedding types
Decompose/recompose	Group results by parent document
Faceted search	Get result counts by field values

When NOT to Use

Scenario	Recommended Alternative
Exact field matching	`attribute_filter`
Full-text keyword search	Combine with text features
No embeddings in collection	`attribute_filter`
Post-search filtering only	Use after `feature_search`

Core Concepts

Feature URIs

Feature URIs identify which embedding index to search. They follow the pattern:

mixpeek://{extractor_name}@{version}/{output_name}

Examples:

mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding - Multimodal text/image/video embeddings
mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1 - Text-only embeddings
mixpeek://image_extractor@v1/google_siglip_base_v1 - Image embeddings
mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1 - Speech/transcript embeddings (from video/audio)

Fusion Strategies

When searching multiple features, results are combined using fusion:

Strategy	Description	Best For
`rrf`	Reciprocal Rank Fusion	General purpose, balanced results
`dbsf`	Distribution-Based Score Fusion	When scores have different distributions
`weighted`	Weighted combination	When you know relative importance
`max`	Maximum score wins	When any match is sufficient
`learned`	Thompson Sampling bandit	Automatically adapts per-user from interaction data

Learned Fusion Configuration

When fusion is set to "learned", you can provide a learning_config object to control how the bandit adapts. See Auto-Tune for a full walkthrough.

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
          "top_k": 100
        }
      ],
      "fusion": "learned",
      "learning_config": {
        "context_features": ["INPUT.user_id"],
        "demographic_features": ["INPUT.user_segment"],
        "reward_map": {
          "click": 1.0,
          "purchase": 3.0,
          "add_to_cart": 2.0,
          "bookmark": 1.5,
          "positive_feedback": 2.0,
          "negative_feedback": -2.0,
          "skip": -0.5
        },
        "min_interactions": 5,
        "exploration_bonus": 1.0,
        "exploration_decay": 0.99,
        "exploration_floor": 0.1,
        "decay_factor": 0.995,
        "decay_window_days": 365,
        "min_weight": 0.05,
        "max_weight": 0.95,
        "rollout_pct": 100.0,
        "shadow_mode": false
      },
      "final_top_k": 25
    }
  }
}

learning_config Fields

Field	Type	Default	Description
`context_features`	`string[]`	`["INPUT.user_id"]`	Input fields for personal-level learning. References `INPUT.*` fields from the retriever’s `input_schema`.
`demographic_features`	`string[]`	`[]`	Input fields for segment-level fallback (e.g., `"INPUT.user_segment"`).
`reward_signal`	`string`	`"click"`	Deprecated. Use `reward_map` instead.
`reward_map`	`object`	See defaults	Maps interaction types to reward magnitudes. Positive values reinforce the associated feature; negative values penalize it.
`min_interactions`	`integer`	`5`	Minimum interactions before personal-level weights are used. Below this, falls back to demographic or global.
`exploration_bonus`	`float`	`1.0`	Initial multiplier for weight distribution variance.
`exploration_decay`	`float`	`0.99`	Per-interaction decay of `exploration_bonus`.
`exploration_floor`	`float`	`0.1`	Minimum exploration bonus (prevents full exploitation).
`decay_factor`	`float`	`0.995`	Per-day exponential decay on older interactions. `1.0` = no decay.
`decay_window_days`	`integer`	`365`	Interactions older than this are excluded entirely.
`min_weight`	`float`	`0.05`	Floor for any feature’s weight after sampling.
`max_weight`	`float`	`0.95`	Ceiling for any feature’s weight after sampling.
`rollout_pct`	`float`	`100.0`	Percentage of requests using learned weights (0-100).
`shadow_mode`	`boolean`	`false`	Compute learned weights but serve static results.

See Auto-Tune for the full concept overview, Reward Signals for reward map customization, and Rollout & Safety for traffic splitting and kill switch details.

Parameters

Parameter	Type	Default	Description
`searches`	array	Required	Array of search configurations
`final_top_k`	integer	`25`	Total results to return after fusion
`fusion`	string	`rrf`	Fusion strategy for multi-search
`group_by`	object	`null`	Group results by field
`facets`	array	`null`	Fields to compute facet counts

Search Object Parameters

Each item in the searches array supports:

Parameter	Type	Default	Description
`feature_uri`	string	Required	Embedding index to search
`query`	string/object	Required	Query text or embedding
`top_k`	integer	`100`	Candidates per search
`filters`	object	`null`	Pre-filter conditions
`weight`	number	`1.0`	Weight for fusion (weighted strategy)
`lexical`	boolean	`false`	Run this search as keyword/BM25 instead of vector (see Lexical (BM25) Search)
`query_preprocessing`	object	`null`	Large file decomposition config

Query Input Modes

The query field on each search object accepts either a plain string (shorthand for text mode) or an object with an explicit input_mode:

Mode	`input_mode`	Value	Supported by
Text	`"text"`	Plain text string	All text-capable extractors
Content	`"content"`	Single URL or base64 data URI	All multimodal extractors
Document	`"document"`	Reference to an existing document	All extractors
Vector	`"vector"`	Pre-computed embedding (list of floats)	All extractors
Multi-content	`"multi_content"`	List of URLs and/or text strings	`gemini_multifile_extractor` only

Text — embed a string and search:

{"input_mode": "text", "value": "{{INPUT.query}}"}

Content — fetch a URL and embed it:

{"input_mode": "content", "value": "{{INPUT.image_url}}"}

Vector — use a pre-computed embedding directly (no inference at query time):

{"input_mode": "vector", "value": "{{INPUT.embedding}}"}

Multi-content — embed multiple files together in one API call. Only valid when the feature_uri points to an extractor whose vector index has supports_multi_query=True (currently: gemini_multifile_extractor). Attempting this with any other feature URI returns a 400 error.

{
  "input_mode": "multi_content",
  "values": ["{{INPUT.image_url}}", "{{INPUT.description}}"]
}

Each item in values is auto-detected: URLs (http://, https://, s3://) are fetched and embedded as files; all other strings are embedded as text. All items are passed to the underlying model in one call, producing a single query vector that mirrors how objects were indexed.

Lexical (BM25) Search

Set lexical: true on a search to run keyword/BM25 matching instead of vector similarity. The query text is matched against the namespace’s full-text index — it is not embedded into a vector. BM25 catches exact tokens that dense embeddings routinely miss: brand names, SKUs, prices like $9.99, promo codes, error strings, and CTAs.

Behavior	Detail
Input	Must be text (`input_mode: "text"`). The query string is used verbatim.
Matching	Across all `text`-indexed string payload fields — not a single field.
`feature_uri`	Used only for collection scoping; no vector index is queried.
Prerequisite	A `text` payload index must exist (see Text Indexes (BM25)).

Searching only one field (e.g. OCR text). BM25 matches across all text-indexed string fields — it can’t be scoped to a single field like ocr_text. To make one field independently searchable, give it its own dense index by running a text_extractor over it (map the extractor’s input to ocr_text), then feature_search that feature URI directly. For coarse exact-substring filtering on a single field, an attribute_filter with the contains operator works but is not relevance-ranked.

The real power is hybrid retrieval — fuse a dense (vector) search with a lexical (BM25) search under rrf so semantic recall and exact-keyword precision reinforce each other:

Dense + Lexical Hybrid (RRF)

{
  "stage_name": "hybrid_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "fusion": "rrf",
      "final_top_k": 25,
      "searches": [
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
          "lexical": true,
          "top_k": 100
        }
      ]
    }
  }
}

Use rrf fusion for dense+lexical hybrid — it ranks by position, so it is immune to the score-scale mismatch between cosine similarity and BM25. Avoid weighted/max here unless you have a specific reason.

Query Preprocessing

When searching with large files (videos, PDFs, long documents) as input, query_preprocessing decomposes the file into chunks using the same extractor pipeline that indexed your data, runs parallel searches for each chunk, and fuses the results. This is ingestion applied to the query — same decomposition and embedding, but vectors are used for search instead of storage.

Parameter	Type	Default	Description
`feature_uri`	string	`null`	Extractor pipeline for decomposition (inherits from search `feature_uri` if not set)
`params`	object	`null`	Extractor parameters — identical schema to the collection’s extractor config for that `feature_uri`
`max_chunks`	integer	`20`	Max chunks to search (1-100). Each chunk runs its own search and adds query cost — reads are metered (see Billing)
`aggregation`	string	`rrf`	Fusion strategy: `rrf`, `max`, or `avg`
`dedup_field`	string	`null`	Field to deduplicate results by

params uses the extractor’s own parameter schema. Whatever parameters the extractor accepts during ingestion (e.g. split_method, time_split_interval for video; chunk_size, chunk_overlap for text) are the same parameters you pass here. There is no separate preprocessing-specific schema — the extractor drives the decomposition exactly as it would during collection processing. Refer to the extractor’s own documentation for valid parameter names.

You can also set query_preprocessing at the stage level (on parameters) to apply it to all searches as a default. Per-search settings override the stage default.

Aggregation strategies:

Strategy	Best For	How It Works
`rrf`	General purpose (recommended)	Rank-based fusion, immune to score magnitude differences
`max`	”Find this exact moment”	Keeps highest score per document across chunks
`avg`	”Find similar overall content”	Averages scores — consistent matches win

Configuration Examples

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        }
      ],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
          "query": "{{INPUT.image_url}}",
          "top_k": 50
        }
      ],
      "final_top_k": 20
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
          "query": "{{INPUT.image_url}}",
          "top_k": 100
        }
      ],
      "fusion": "rrf",
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": "{{INPUT.query}}",
          "top_k": 100,
          "weight": 0.7
        },
        {
          "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
          "query": "{{INPUT.image_url}}",
          "top_k": 100,
          "weight": 0.3
        }
      ],
      "fusion": "weighted",
      "final_top_k": 20
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100,
          "filters": {
            "AND": [
              {"field": "metadata.status", "operator": "eq", "value": "published"},
              {"field": "metadata.category", "operator": "in", "value": ["tech", "science"]}
            ]
          }
        }
      ],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 200
        }
      ],
      "group_by": {
        "field": "metadata.parent_id",
        "limit": 10,
        "group_size": 3
      },
      "final_top_k": 30
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        }
      ],
      "facets": ["metadata.category", "metadata.author", "metadata.year"],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
          "query": {
            "input_mode": "multi_content",
            "values": [
              "{{INPUT.image_url}}",
              "{{INPUT.spec_sheet_url}}",
              "{{INPUT.description}}"
            ]
          },
          "top_k": 20
        }
      ],
      "final_top_k": 10
    }
  }
}

Query Preprocessing Examples

Search with a large video — decompose it into 10-second segments, search each, and fuse:

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": {"input_mode": "content", "value": "{{INPUT.video}}"},
          "top_k": 100,
          "query_preprocessing": {
            "params": {"split_method": "time", "time_split_interval": 10},
            "max_chunks": 20,
            "aggregation": "max"
          }
        }
      ],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://image_extractor@v1/siglip_embedding",
          "query": {"input_mode": "content", "value": "{{INPUT.pdf_document}}"},
          "top_k": 100,
          "query_preprocessing": {
            "max_chunks": 50,
            "aggregation": "rrf"
          }
        }
      ],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": {"input_mode": "content", "value": "{{INPUT.video}}"},
          "query_preprocessing": {
            "params": {"time_split_interval": 10},
            "aggregation": "max"
          }
        },
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": {"input_mode": "text", "value": "{{INPUT.text_query}}"}
        }
      ],
      "final_top_k": 25
    }
  }
}

{
  "stage_name": "feature_search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "query_preprocessing": {
        "max_chunks": 15,
        "aggregation": "rrf"
      },
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": {"input_mode": "content", "value": "{{INPUT.video1}}"}
        },
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": {"input_mode": "content", "value": "{{INPUT.video2}}"}
        }
      ],
      "final_top_k": 25
    }
  }
}

Preprocessing uses the same extractor pipeline that indexed your data. The params accept the same fields you configured on your collection’s feature extractor (e.g., split_method, chunk_size). If you don’t specify params, extractor defaults are used.

The response includes preprocessing metadata showing what happened:

{
  "metadata": {
    "preprocessing": {
      "content_type": "video/mp4",
      "extractor": "multimodal_extractor@v1",
      "chunks_generated": 18,
      "chunks_searched": 18,
      "aggregation": "rrf",
      "preprocessing_ms": 12450
    }
  }
}

Each result also includes query_chunks showing which parts of your query matched:

{
  "document_id": "doc_abc123",
  "score": 0.89,
  "query_chunks": [
    {"chunk_index": 0, "start_ms": 0, "end_ms": 10000, "score": 0.92},
    {"chunk_index": 2, "start_ms": 20000, "end_ms": 30000, "score": 0.87}
  ]
}

Grouping (Decompose/Recompose)

When documents are decomposed into chunks (e.g., video frames, document pages), use group_by to recompose results by parent:

{
  "group_by": {
    "field": "metadata.parent_id",
    "limit": 10,
    "group_size": 3
  }
}

Parameter	Description
`field`	Field to group by (e.g., parent document ID)
`limit`	Maximum number of groups to return
`group_size`	Maximum documents per group

Use cases:

Video search: Group frames by video, return top 3 frames per video
Document search: Group chunks by document, return best chunks per doc
Product search: Group variants by product family

Faceted Search

Get counts of results by field values for building filter UIs:

{
  "facets": ["metadata.category", "metadata.brand", "metadata.price_range"]
}

Response includes:

{
  "facets": {
    "metadata.category": [
      {"value": "electronics", "count": 45},
      {"value": "clothing", "count": 23}
    ],
    "metadata.brand": [
      {"value": "Apple", "count": 12},
      {"value": "Samsung", "count": 8}
    ]
  }
}

Filter Syntax

Filtered fields must have payload indexes on your namespace. Without indexes, filtering is slow and the response includes warnings about unindexed fields.

Pre-filters use boolean logic with AND/OR/NOT:

{
  "filters": {
    "AND": [
      {"field": "metadata.status", "operator": "eq", "value": "active"},
      {
        "OR": [
          {"field": "metadata.category", "operator": "eq", "value": "tech"},
          {"field": "metadata.category", "operator": "eq", "value": "science"}
        ]
      }
    ]
  }
}

Supported Operators

Operator	Description	Example
`eq`	Equals	`{"field": "status", "operator": "eq", "value": "active"}`
`ne`	Not equals	`{"field": "status", "operator": "ne", "value": "deleted"}`
`gt`	Greater than	`{"field": "price", "operator": "gt", "value": 100}`
`gte`	Greater than or equal	`{"field": "rating", "operator": "gte", "value": 4}`
`lt`	Less than	`{"field": "age", "operator": "lt", "value": 30}`
`lte`	Less than or equal	`{"field": "count", "operator": "lte", "value": 10}`
`in`	In array	`{"field": "category", "operator": "in", "value": ["a", "b"]}`
`nin`	Not in array	`{"field": "status", "operator": "nin", "value": ["deleted", "archived"]}`
`contains`	Contains substring	`{"field": "title", "operator": "contains", "value": "guide"}`
`exists`	Field exists	`{"field": "metadata.optional", "operator": "exists", "value": true}`

Performance

Metric	Value
Latency	10-50ms (single search)
Latency	20-80ms (multi-search with fusion)
Optimal top_k	100-500 per search
Maximum top_k	10,000 per search
Fusion overhead	< 5ms

For best performance, use pre-filters to reduce the search space. Filtering at the vector index level is much faster than post-filtering in later stages.

Common Pipeline Patterns

Basic Search + Rerank

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 100
          }
        ],
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "top_k": 10
      }
    }
  }
]

Multimodal Search + Filter + Limit

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 100
          },
          {
            "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
            "query": "{{INPUT.image}}",
            "top_k": 100
          }
        ],
        "fusion": "rrf",
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "attribute_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "field": "metadata.in_stock",
        "operator": "eq",
        "value": true
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "count": 20
      }
    }
  }
]

Video Search with Frame Grouping

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 500
          }
        ],
        "group_by": {
          "field": "metadata.video_id",
          "limit": 10,
          "group_size": 5
        },
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Summarize why these video segments match the query"
      }
    }
  }
]

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 200,
            "filters": {
              "AND": [
                {"field": "metadata.in_stock", "operator": "eq", "value": true},
                {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
              ]
            }
          }
        ],
        "facets": ["metadata.category", "metadata.brand", "metadata.color"],
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "sort_attribute",
    "stage_type": "sort",
    "config": {
      "stage_id": "sort_attribute",
      "parameters": {
        "field": "{{INPUT.sort_by}}",
        "direction": "{{INPUT.sort_order}}"
      }
    }
  }
]

Output Schema

Each result includes:

Field	Type	Description
`document_id`	string	Unique document identifier
`score`	float	Combined similarity score
`content`	string	Document content
`metadata`	object	Document metadata
`features`	object	Feature data and scores per search

Example output:

{
  "document_id": "doc_abc123",
  "score": 0.892,
  "content": "Document content here...",
  "metadata": {
    "title": "Example Document",
    "category": "tech",
    "created_at": "2024-01-15T10:30:00Z"
  },
  "features": {
    "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding": {
      "score": 0.91
    },
    "mixpeek://image_extractor@v1/google_siglip_base_v1": {
      "score": 0.87
    }
  }
}

Aspect	feature_search	attribute_filter
Purpose	Semantic similarity	Exact matching
Input	Query text/embedding	Field conditions
Scoring	Vector similarity	Binary match
Speed	10-50ms	5-20ms
Use when	Finding similar content	Filtering by metadata

Error Handling

Error	Behavior
Invalid feature_uri	Stage fails with error
Empty query	Returns empty results
Filter syntax error	Stage fails with error
No matching documents	Returns empty results

Creating a Retriever with feature_search

The following is a complete working example of creating a retriever that uses the feature_search stage, then executing it. Pay close attention to the field names — several are easy to confuse.

Common mistakes:

Use collection_identifiers (not collection_ids) in the retriever body.
input_schema is a flat map keyed by field name ({"query": {"type": "text"}}) — do not wrap it in a JSON Schema object ({"properties": {...}, "type": "object"}).
Use type: "text" (not "string") in input_schema values.
stage_type at the outer level must be "filter" — passing stage_type: "feature_search" is rejected (feature_search is a stage_id, not a stage_type).
stage_id: "feature_search" lives inside the config object, not at the outer stage_id.
Inside each search, the query value uses {"input_mode": "text", "value": "..."} — the value key, not a bare text key.
final_top_k lives inside config.parameters, not at the top level.
If a feature_uri is wrong, the error lists the available_feature_uris for your target collections — copy the exact URI (e.g. multilingual_e5_large_instruct_v1, not embedding).

Step 1 — Create the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "my-semantic-retriever",

    "collection_identifiers": ["col_abc123"],

    "input_schema": {
      "query": {
        "type": "text",
        "description": "Search query",
        "required": true
      }
    },

    "stages": [
      {
        "stage_name": "Semantic Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": {
                  "input_mode": "text",
                  "value": "{{INPUT.query}}"
                },
                "top_k": 50
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

import httpx

response = httpx.post(
    "https://api.mixpeek.com/v1/retrievers",
    headers={
        "Authorization": f"Bearer {api_key}",
        "X-Namespace": namespace_id,
    },
    json={
        "retriever_name": "my-semantic-retriever",
        # ✅ correct field: collection_identifiers
        "collection_identifiers": ["col_abc123"],
        "input_schema": {
            "query": {
                "type": "text",       # ✅ "text", not "string"
                "description": "Search query",
                "required": True,
            }
        },
        "stages": [
            {
                "stage_name": "Semantic Search",
                "stage_type": "filter",   # ✅ required at outer stage
                "config": {
                    "stage_id": "feature_search",  # ✅ inside config
                    "parameters": {
                        "searches": [
                            {
                                "feature_uri": (
                                    "mixpeek://text_extractor@v1/"
                                    "multilingual_e5_large_instruct_v1"
                                ),
                                "query": {
                                    "input_mode": "text",
                                    "value": "{{INPUT.query}}",
                                },
                                "top_k": 50,
                            }
                        ],
                        "final_top_k": 10,  # ✅ inside parameters
                    },
                },
            }
        ],
    },
)
data = response.json()
retriever_id = data["retriever_id"]  # retriever_id is top-level on the response

Step 2 — Execute the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query": "machine learning for fashion brand compliance"
    }
  }'

response = httpx.post(
    f"https://api.mixpeek.com/v1/retrievers/{retriever_id}/execute",
    headers={
        "Authorization": f"Bearer {api_key}",
        "X-Namespace": namespace_id,
    },
    json={
        "inputs": {"query": "machine learning for fashion brand compliance"}
    },
)
results = response.json()

Finding the Right feature_uri

The feature_uri must match an embedding index that exists in your namespace. To discover available feature URIs, list the vector indexes in a collection:

curl "https://api.mixpeek.com/v1/collections/$COLLECTION_ID" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID"

Each item in the vector_indexes array has a feature_uri field — use that value directly in your retriever stage.

Embedding Task Conditioning

Feature search automatically applies task-aware embedding conditioning to instruction-aware models (E5, Gemini) at query time. This means query embeddings are optimized for asymmetric retrieval without any configuration. How it works:

Index time: Extractors embed documents with retrieval_document task (configurable via embedding_task on the extractor — see Text Extractor)
Query time: Feature search automatically uses retrieval_query task for all query embeddings

This asymmetric pairing (document vs. query) improves retrieval quality by ~10% for instruction-aware models like E5-Large. The embedding_task used is included in the stage response metadata:

{
  "metadata": {
    "embedding_task": "retrieval_query",
    "num_features": 1,
    "fusion_strategy": "rrf",
    "total_results": 25
  }
}

Task-aware models:

Model	Task Support	Used By
E5-Large (`intfloat_e5_large_instruct_v1`)	Prefix-based (`"query: "` / `"passage: "`)	`text_extractor`, `multimodal_extractor` transcription
Gemini Embedding 2	Instruction-based	`universal_extractor`
Vertex Multimodal	Not task-aware (ignored)	`multimodal_extractor` visual
SigLIP / CLIP	Not task-aware (ignored)	`image_extractor`

Attribute Filter - Metadata-based filtering
Rerank - Neural re-ranking
Query Expand - Query expansion before search
MMR - Diversity-optimized selection

When to Use

When NOT to Use

Core Concepts

Feature URIs

Fusion Strategies

Learned Fusion Configuration

learning_config Fields

Parameters

Search Object Parameters

Query Input Modes

Lexical (BM25) Search

Query Preprocessing

Configuration Examples

Query Preprocessing Examples

Grouping (Decompose/Recompose)

Faceted Search

Filter Syntax

Supported Operators

Performance

Common Pipeline Patterns

Basic Search + Rerank

Multimodal Search + Filter + Limit

Video Search with Frame Grouping

E-commerce Search with Facets

Output Schema

Comparison: feature_search vs attribute_filter

Error Handling

Creating a Retriever with feature_search

Step 1 — Create the Retriever

Step 2 — Execute the Retriever

Finding the Right feature_uri

Embedding Task Conditioning

​When to Use

​When NOT to Use

​Core Concepts

​Feature URIs

​Fusion Strategies

​Learned Fusion Configuration

​learning_config Fields

​Parameters

​Search Object Parameters

​Query Input Modes

​Lexical (BM25) Search

​Query Preprocessing

​Configuration Examples

​Query Preprocessing Examples

​Grouping (Decompose/Recompose)

​Faceted Search

​Filter Syntax

​Supported Operators

​Performance

​Common Pipeline Patterns

​Basic Search + Rerank

​Multimodal Search + Filter + Limit

​Video Search with Frame Grouping

​E-commerce Search with Facets

​Output Schema

​Comparison: feature_search vs attribute_filter

​Error Handling

​Creating a Retriever with feature_search

​Step 1 — Create the Retriever

​Step 2 — Execute the Retriever

​Finding the Right feature_uri

​Embedding Task Conditioning

​Related Stages

When to Use

When NOT to Use

Core Concepts

Feature URIs

Fusion Strategies

Learned Fusion Configuration

learning_config Fields

Parameters

Search Object Parameters

Query Input Modes

Lexical (BM25) Search

Query Preprocessing

Configuration Examples

Query Preprocessing Examples

Grouping (Decompose/Recompose)

Faceted Search

Filter Syntax

Supported Operators

Performance

Common Pipeline Patterns

Basic Search + Rerank

Multimodal Search + Filter + Limit

Video Search with Frame Grouping

E-commerce Search with Facets

Output Schema

Comparison: feature_search vs attribute_filter

Error Handling

Creating a Retriever with feature_search

Step 1 — Create the Retriever

Step 2 — Execute the Retriever

Finding the Right feature_uri

Embedding Task Conditioning

Related Stages