Feature Search

Feature Search stage showing multi-vector semantic search with fusion

The Feature Search stage is the primary search stage for retrieval pipelines. It performs vector similarity search across one or more embedding features, supporting single-modal, multimodal, and hybrid search patterns. Results from multiple searches are fused using configurable strategies (RRF, DBSF, weighted, max, or learned).

Stage Category: FILTER (Retrieves documents)Transformation: 0 documents → N documents (retrieves from collection based on vector similarity)

When to Use

Use Case	Description
Semantic search	Find documents similar in meaning to a query
Image search	Search by image embeddings
Video search	Search by video frame embeddings
Multimodal search	Combine text + image + video in one query
Hybrid search	Fuse results from multiple embedding types
Decompose/recompose	Group results by parent document
Faceted search	Get result counts by field values

When NOT to Use

Scenario	Recommended Alternative
Exact field matching	`attribute_filter`
Full-text keyword search	Combine with text features
No embeddings in collection	`attribute_filter`
Post-search filtering only	Use after `feature_search`

Core Concepts

Feature URIs

Feature URIs identify which embedding index to search. They follow the pattern:

mixpeek://{extractor_name}@{version}/{output_name}

Examples:

mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding - Multimodal text/image embeddings
mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1 - Text-only embeddings
mixpeek://image_extractor@v1/embedding - Image embeddings
mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding - Video frame embeddings

Fusion Strategies

When searching multiple features, results are combined using fusion:

Strategy	Description	Best For
`rrf`	Reciprocal Rank Fusion	General purpose, balanced results
`dbsf`	Distribution-Based Score Fusion	When scores have different distributions
`weighted`	Weighted combination	When you know relative importance
`max`	Maximum score wins	When any match is sufficient
`learned`	ML-based fusion	Optimized from interaction data

Parameters

Parameter	Type	Default	Description
`searches`	array	Required	Array of search configurations
`final_top_k`	integer	`25`	Total results to return after fusion
`fusion`	string	`rrf`	Fusion strategy for multi-search
`group_by`	object	`null`	Group results by field
`facets`	array	`null`	Fields to compute facet counts

Search Object Parameters

Each item in the searches array supports:

Parameter	Type	Default	Description
`feature_uri`	string	Required	Embedding index to search
`query`	string/object	Required	Query text or embedding
`top_k`	integer	`100`	Candidates per search
`filters`	object	`null`	Pre-filter conditions
`weight`	number	`1.0`	Weight for fusion (weighted strategy)
`query_preprocessing`	object	`null`	Large file decomposition config

Query Input Modes

The query field on each search object accepts either a plain string (shorthand for text mode) or an object with an explicit input_mode:

Mode	`input_mode`	Value	Supported by
Text	`"text"`	Plain text string	All text-capable extractors
Content	`"content"`	Single URL or base64 data URI	All multimodal extractors
Document	`"document"`	Reference to an existing document	All extractors
Vector	`"vector"`	Pre-computed embedding (list of floats)	All extractors
Multi-content	`"multi_content"`	List of URLs and/or text strings	`gemini_multifile_extractor` only

Text — embed a string and search:

{"input_mode": "text", "value": "{{INPUT.query}}"}

Content — fetch a URL and embed it:

{"input_mode": "content", "value": "{{INPUT.image_url}}"}

Vector — use a pre-computed embedding directly (no inference at query time):

{"input_mode": "vector", "value": "{{INPUT.embedding}}"}

Multi-content — embed multiple files together in one API call. Only valid when the feature_uri points to an extractor whose vector index has supports_multi_query=True (currently: gemini_multifile_extractor). Attempting this with any other feature URI returns a 400 error.

{
  "input_mode": "multi_content",
  "values": ["{{INPUT.image_url}}", "{{INPUT.description}}"]
}

Each item in values is auto-detected: URLs (http://, https://, s3://) are fetched and embedded as files; all other strings are embedded as text. All items are passed to the underlying model in one call, producing a single query vector that mirrors how objects were indexed.

Query Preprocessing

When searching with large files (videos, PDFs, long documents) as input, query_preprocessing decomposes the file into chunks using the same extractor pipeline that indexed your data, runs parallel searches for each chunk, and fuses the results. This is ingestion applied to the query — same decomposition and embedding, but vectors are used for search instead of storage.

Parameter	Type	Default	Description
`feature_uri`	string	`null`	Extractor pipeline for decomposition (inherits from search `feature_uri` if not set)
`params`	object	`null`	Extractor parameters — identical schema to the collection’s extractor config for that `feature_uri`
`max_chunks`	integer	`20`	Max chunks to search (1-100). Each chunk = 1 credit
`aggregation`	string	`rrf`	Fusion strategy: `rrf`, `max`, or `avg`
`dedup_field`	string	`null`	Field to deduplicate results by

params uses the extractor’s own parameter schema. Whatever parameters the extractor accepts during ingestion (e.g. split_method, time_split_interval for video; chunk_size, chunk_overlap for text) are the same parameters you pass here. There is no separate preprocessing-specific schema — the extractor drives the decomposition exactly as it would during collection processing. Refer to the extractor’s own documentation for valid parameter names.

You can also set query_preprocessing at the stage level (on parameters) to apply it to all searches as a default. Per-search settings override the stage default.

Aggregation strategies:

Strategy	Best For	How It Works
`rrf`	General purpose (recommended)	Rank-based fusion, immune to score magnitude differences
`max`	”Find this exact moment”	Keeps highest score per document across chunks
`avg`	”Find similar overall content”	Averages scores — consistent matches win

Configuration Examples

{
  "stage_type": "filter",
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": "{{INPUT.query}}",
        "top_k": 100
      }
    ],
    "final_top_k": 25
  }
}

Query Preprocessing Examples

Search with a large video — decompose it into 10-second segments, search each, and fuse:

{
  "stage_type": "filter",
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": {"input_mode": "content", "value": "{{INPUT.video}}"},
        "top_k": 100,
        "query_preprocessing": {
          "params": {"split_method": "time", "time_split_interval": 10},
          "max_chunks": 20,
          "aggregation": "max"
        }
      }
    ],
    "final_top_k": 25
  }
}

Preprocessing uses the same extractor pipeline that indexed your data. The params accept the same fields you configured on your collection’s feature extractor (e.g., split_method, chunk_size). If you don’t specify params, extractor defaults are used.

The response includes preprocessing metadata showing what happened:

{
  "metadata": {
    "preprocessing": {
      "content_type": "video/mp4",
      "extractor": "multimodal_extractor@v1",
      "chunks_generated": 18,
      "chunks_searched": 18,
      "aggregation": "rrf",
      "preprocessing_ms": 12450
    }
  }
}

Each result also includes query_chunks showing which parts of your query matched:

{
  "document_id": "doc_abc123",
  "score": 0.89,
  "query_chunks": [
    {"chunk_index": 0, "start_ms": 0, "end_ms": 10000, "score": 0.92},
    {"chunk_index": 2, "start_ms": 20000, "end_ms": 30000, "score": 0.87}
  ]
}

Grouping (Decompose/Recompose)

When documents are decomposed into chunks (e.g., video frames, document pages), use group_by to recompose results by parent:

{
  "group_by": {
    "field": "metadata.parent_id",
    "limit": 10,
    "group_size": 3
  }
}

Parameter	Description
`field`	Field to group by (e.g., parent document ID)
`limit`	Maximum number of groups to return
`group_size`	Maximum documents per group

Use cases:

Video search: Group frames by video, return top 3 frames per video
Document search: Group chunks by document, return best chunks per doc
Product search: Group variants by product family

Faceted Search

Get counts of results by field values for building filter UIs:

{
  "facets": ["metadata.category", "metadata.brand", "metadata.price_range"]
}

Response includes:

{
  "facets": {
    "metadata.category": [
      {"value": "electronics", "count": 45},
      {"value": "clothing", "count": 23}
    ],
    "metadata.brand": [
      {"value": "Apple", "count": 12},
      {"value": "Samsung", "count": 8}
    ]
  }
}

Filter Syntax

Pre-filters use boolean logic with AND/OR/NOT:

{
  "filters": {
    "AND": [
      {"field": "metadata.status", "operator": "eq", "value": "active"},
      {
        "OR": [
          {"field": "metadata.category", "operator": "eq", "value": "tech"},
          {"field": "metadata.category", "operator": "eq", "value": "science"}
        ]
      }
    ]
  }
}

Supported Operators

Operator	Description	Example
`eq`	Equals	`{"field": "status", "operator": "eq", "value": "active"}`
`ne`	Not equals	`{"field": "status", "operator": "ne", "value": "deleted"}`
`gt`	Greater than	`{"field": "price", "operator": "gt", "value": 100}`
`gte`	Greater than or equal	`{"field": "rating", "operator": "gte", "value": 4}`
`lt`	Less than	`{"field": "age", "operator": "lt", "value": 30}`
`lte`	Less than or equal	`{"field": "count", "operator": "lte", "value": 10}`
`in`	In array	`{"field": "category", "operator": "in", "value": ["a", "b"]}`
`nin`	Not in array	`{"field": "status", "operator": "nin", "value": ["deleted", "archived"]}`
`contains`	Contains substring	`{"field": "title", "operator": "contains", "value": "guide"}`
`exists`	Field exists	`{"field": "metadata.optional", "operator": "exists", "value": true}`

Performance

Metric	Value
Latency	10-50ms (single search)
Latency	20-80ms (multi-search with fusion)
Optimal top_k	100-500 per search
Maximum top_k	10,000 per search
Fusion overhead	< 5ms

For best performance, use pre-filters to reduce the search space. Filtering at the vector index level is much faster than post-filtering in later stages.

Common Pipeline Patterns

Basic Search + Rerank

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        }
      ],
      "final_top_k": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 10
    }
  }
]

Multimodal Search + Filter + Limit

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://image_extractor@v1/embedding",
          "query": "{{INPUT.image}}",
          "top_k": 100
        }
      ],
      "fusion": "rrf",
      "final_top_k": 50
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "attribute_filter",
    "parameters": {
      "field": "metadata.in_stock",
      "operator": "eq",
      "value": true
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "limit": 20
    }
  }
]

Video Search with Frame Grouping

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 500
        }
      ],
      "group_by": {
        "field": "metadata.video_id",
        "limit": 10,
        "group_size": 5
      },
      "final_top_k": 50
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Summarize why these video segments match the query"
    }
  }
]

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 200,
          "filters": {
            "AND": [
              {"field": "metadata.in_stock", "operator": "eq", "value": true},
              {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
            ]
          }
        }
      ],
      "facets": ["metadata.category", "metadata.brand", "metadata.color"],
      "final_top_k": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "sort_attribute",
    "parameters": {
      "sort_field": "{{INPUT.sort_by}}",
      "order": "{{INPUT.sort_order}}"
    }
  }
]

Output Schema

Each result includes:

Field	Type	Description
`document_id`	string	Unique document identifier
`score`	float	Combined similarity score
`content`	string	Document content
`metadata`	object	Document metadata
`features`	object	Feature data and scores per search

Example output:

{
  "document_id": "doc_abc123",
  "score": 0.892,
  "content": "Document content here...",
  "metadata": {
    "title": "Example Document",
    "category": "tech",
    "created_at": "2024-01-15T10:30:00Z"
  },
  "features": {
    "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding": {
      "score": 0.91
    },
    "mixpeek://image_extractor@v1/embedding": {
      "score": 0.87
    }
  }
}

Aspect	feature_search	attribute_filter
Purpose	Semantic similarity	Exact matching
Input	Query text/embedding	Field conditions
Scoring	Vector similarity	Binary match
Speed	10-50ms	5-20ms
Use when	Finding similar content	Filtering by metadata

Error Handling

Error	Behavior
Invalid feature_uri	Stage fails with error
Empty query	Returns empty results
Filter syntax error	Stage fails with error
No matching documents	Returns empty results

Creating a Retriever with feature_search

The following is a complete working example of creating a retriever that uses the feature_search stage, then executing it. Pay close attention to the field names — several are easy to confuse.

Common mistakes:

Use collection_identifiers (not collection_ids) in the retriever body.
Use type: "text" (not "string") in input_schema values.
stage_type at the outer level must be "filter".
stage_id: "feature_search" lives inside the config object, not at the outer stage_id.
final_top_k lives inside config.parameters, not at the top level.

Step 1 — Create the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "my-semantic-retriever",

    "collection_identifiers": ["col_abc123"],

    "input_schema": {
      "query": {
        "type": "text",
        "description": "Search query",
        "required": true
      }
    },

    "stages": [
      {
        "stage_name": "Semantic Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": {
                  "input_mode": "text",
                  "value": "{{INPUT.query}}"
                },
                "top_k": 50
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

Step 2 — Execute the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query": "machine learning for fashion brand compliance"
    }
  }'

Finding the Right feature_uri

The feature_uri must match an embedding index that exists in your namespace. To discover available feature URIs, list the vector indexes in a collection:

curl "https://api.mixpeek.com/v1/collections/$COLLECTION_ID" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID"

Each item in the vector_indexes array has a feature_uri field — use that value directly in your retriever stage.

Attribute Filter - Metadata-based filtering
Rerank - Neural re-ranking
Query Expand - Query expansion before search
MMR - Diversity-optimized selection

Get Started

Agent Integrations

What Mixpeek Extracts

Retrieval

Platform

Relevance & Personalization

Enrich & Organize

Best Practices

Troubleshoot

When to Use

When NOT to Use

Core Concepts

Feature URIs

Fusion Strategies

Parameters

Search Object Parameters

Query Input Modes

Query Preprocessing

Configuration Examples

Query Preprocessing Examples

Grouping (Decompose/Recompose)

Faceted Search

Filter Syntax

Supported Operators

Performance

Common Pipeline Patterns

Basic Search + Rerank

Multimodal Search + Filter + Limit

Video Search with Frame Grouping

E-commerce Search with Facets

Output Schema

Comparison: feature_search vs attribute_filter

Error Handling

Creating a Retriever with feature_search

Step 1 — Create the Retriever

Step 2 — Execute the Retriever

Finding the Right feature_uri

Get Started

Agent Integrations

What Mixpeek Extracts

Retrieval

Platform

Relevance & Personalization

Enrich & Organize

Best Practices

Troubleshoot

​When to Use

​When NOT to Use

​Core Concepts

​Feature URIs

​Fusion Strategies

​Parameters

​Search Object Parameters

​Query Input Modes

​Query Preprocessing

​Configuration Examples

​Query Preprocessing Examples

​Grouping (Decompose/Recompose)

​Faceted Search

​Filter Syntax

​Supported Operators

​Performance

​Common Pipeline Patterns

​Basic Search + Rerank

​Multimodal Search + Filter + Limit

​Video Search with Frame Grouping

​E-commerce Search with Facets

​Output Schema

​Comparison: feature_search vs attribute_filter

​Error Handling

​Creating a Retriever with feature_search

​Step 1 — Create the Retriever

​Step 2 — Execute the Retriever

​Finding the Right feature_uri

​Related Stages

When to Use

When NOT to Use

Core Concepts

Feature URIs

Fusion Strategies

Parameters

Search Object Parameters

Query Input Modes

Query Preprocessing

Configuration Examples

Query Preprocessing Examples

Grouping (Decompose/Recompose)

Faceted Search

Filter Syntax

Supported Operators

Performance

Common Pipeline Patterns

Basic Search + Rerank

Multimodal Search + Filter + Limit

Video Search with Frame Grouping

E-commerce Search with Facets

Output Schema

Comparison: feature_search vs attribute_filter

Error Handling

Creating a Retriever with feature_search

Step 1 — Create the Retriever

Step 2 — Execute the Retriever

Finding the Right feature_uri

Related Stages