Skip to main content
Feature Search stage showing multi-vector semantic search with fusion
The Feature Search stage is the primary search stage for retrieval pipelines. It performs vector similarity search across one or more embedding features, supporting single-modal, multimodal, and hybrid search patterns. Results from multiple searches are fused using configurable strategies (RRF, DBSF, weighted, max, or learned).
Stage Category: FILTER (Retrieves documents)Transformation: 0 documents → N documents (retrieves from collection based on vector similarity)

When to Use

Use CaseDescription
Semantic searchFind documents similar in meaning to a query
Image searchSearch by image embeddings
Video searchSearch by video frame embeddings
Multimodal searchCombine text + image + video in one query
Hybrid searchFuse results from multiple embedding types
Decompose/recomposeGroup results by parent document
Faceted searchGet result counts by field values

When NOT to Use

ScenarioRecommended Alternative
Exact field matchingattribute_filter
Full-text keyword searchCombine with text features
No embeddings in collectionattribute_filter
Post-search filtering onlyUse after feature_search

Core Concepts

Feature URIs

Feature URIs identify which embedding index to search. They follow the pattern:
mixpeek://{extractor_name}@{version}/{output_name}
Examples:
  • mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding - Multimodal text/image embeddings
  • mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1 - Text-only embeddings
  • mixpeek://image_extractor@v1/embedding - Image embeddings
  • mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding - Video frame embeddings

Fusion Strategies

When searching multiple features, results are combined using fusion:
StrategyDescriptionBest For
rrfReciprocal Rank FusionGeneral purpose, balanced results
dbsfDistribution-Based Score FusionWhen scores have different distributions
weightedWeighted combinationWhen you know relative importance
maxMaximum score winsWhen any match is sufficient
learnedML-based fusionOptimized from interaction data

Parameters

ParameterTypeDefaultDescription
searchesarrayRequiredArray of search configurations
final_top_kinteger25Total results to return after fusion
fusionstringrrfFusion strategy for multi-search
group_byobjectnullGroup results by field
facetsarraynullFields to compute facet counts

Search Object Parameters

Each item in the searches array supports:
ParameterTypeDefaultDescription
feature_uristringRequiredEmbedding index to search
querystring/objectRequiredQuery text or embedding
top_kinteger100Candidates per search
filtersobjectnullPre-filter conditions
weightnumber1.0Weight for fusion (weighted strategy)
query_preprocessingobjectnullLarge file decomposition config

Query Input Modes

The query field on each search object accepts either a plain string (shorthand for text mode) or an object with an explicit input_mode:
Modeinput_modeValueSupported by
Text"text"Plain text stringAll text-capable extractors
Content"content"Single URL or base64 data URIAll multimodal extractors
Document"document"Reference to an existing documentAll extractors
Vector"vector"Pre-computed embedding (list of floats)All extractors
Multi-content"multi_content"List of URLs and/or text stringsgemini_multifile_extractor only
Text — embed a string and search:
{"input_mode": "text", "value": "{{INPUT.query}}"}
Content — fetch a URL and embed it:
{"input_mode": "content", "value": "{{INPUT.image_url}}"}
Vector — use a pre-computed embedding directly (no inference at query time):
{"input_mode": "vector", "value": "{{INPUT.embedding}}"}
Multi-content — embed multiple files together in one API call. Only valid when the feature_uri points to an extractor whose vector index has supports_multi_query=True (currently: gemini_multifile_extractor). Attempting this with any other feature URI returns a 400 error.
{
  "input_mode": "multi_content",
  "values": ["{{INPUT.image_url}}", "{{INPUT.description}}"]
}
Each item in values is auto-detected: URLs (http://, https://, s3://) are fetched and embedded as files; all other strings are embedded as text. All items are passed to the underlying model in one call, producing a single query vector that mirrors how objects were indexed.

Query Preprocessing

When searching with large files (videos, PDFs, long documents) as input, query_preprocessing decomposes the file into chunks using the same extractor pipeline that indexed your data, runs parallel searches for each chunk, and fuses the results. This is ingestion applied to the query — same decomposition and embedding, but vectors are used for search instead of storage.
ParameterTypeDefaultDescription
feature_uristringnullExtractor pipeline for decomposition (inherits from search feature_uri if not set)
paramsobjectnullExtractor parameters — identical schema to the collection’s extractor config for that feature_uri
max_chunksinteger20Max chunks to search (1-100). Each chunk = 1 credit
aggregationstringrrfFusion strategy: rrf, max, or avg
dedup_fieldstringnullField to deduplicate results by
params uses the extractor’s own parameter schema. Whatever parameters the extractor accepts during ingestion (e.g. split_method, time_split_interval for video; chunk_size, chunk_overlap for text) are the same parameters you pass here. There is no separate preprocessing-specific schema — the extractor drives the decomposition exactly as it would during collection processing. Refer to the extractor’s own documentation for valid parameter names.
You can also set query_preprocessing at the stage level (on parameters) to apply it to all searches as a default. Per-search settings override the stage default.
Aggregation strategies:
StrategyBest ForHow It Works
rrfGeneral purpose (recommended)Rank-based fusion, immune to score magnitude differences
max”Find this exact moment”Keeps highest score per document across chunks
avg”Find similar overall content”Averages scores — consistent matches win

Configuration Examples

{
  "stage_type": "filter",
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": "{{INPUT.query}}",
        "top_k": 100
      }
    ],
    "final_top_k": 25
  }
}

Query Preprocessing Examples

Search with a large video — decompose it into 10-second segments, search each, and fuse:
{
  "stage_type": "filter",
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": {"input_mode": "content", "value": "{{INPUT.video}}"},
        "top_k": 100,
        "query_preprocessing": {
          "params": {"split_method": "time", "time_split_interval": 10},
          "max_chunks": 20,
          "aggregation": "max"
        }
      }
    ],
    "final_top_k": 25
  }
}
Preprocessing uses the same extractor pipeline that indexed your data. The params accept the same fields you configured on your collection’s feature extractor (e.g., split_method, chunk_size). If you don’t specify params, extractor defaults are used.
The response includes preprocessing metadata showing what happened:
{
  "metadata": {
    "preprocessing": {
      "content_type": "video/mp4",
      "extractor": "multimodal_extractor@v1",
      "chunks_generated": 18,
      "chunks_searched": 18,
      "aggregation": "rrf",
      "preprocessing_ms": 12450
    }
  }
}
Each result also includes query_chunks showing which parts of your query matched:
{
  "document_id": "doc_abc123",
  "score": 0.89,
  "query_chunks": [
    {"chunk_index": 0, "start_ms": 0, "end_ms": 10000, "score": 0.92},
    {"chunk_index": 2, "start_ms": 20000, "end_ms": 30000, "score": 0.87}
  ]
}

Grouping (Decompose/Recompose)

When documents are decomposed into chunks (e.g., video frames, document pages), use group_by to recompose results by parent:
{
  "group_by": {
    "field": "metadata.parent_id",
    "limit": 10,
    "group_size": 3
  }
}
ParameterDescription
fieldField to group by (e.g., parent document ID)
limitMaximum number of groups to return
group_sizeMaximum documents per group
Use cases:
  • Video search: Group frames by video, return top 3 frames per video
  • Document search: Group chunks by document, return best chunks per doc
  • Product search: Group variants by product family
Get counts of results by field values for building filter UIs:
{
  "facets": ["metadata.category", "metadata.brand", "metadata.price_range"]
}
Response includes:
{
  "facets": {
    "metadata.category": [
      {"value": "electronics", "count": 45},
      {"value": "clothing", "count": 23}
    ],
    "metadata.brand": [
      {"value": "Apple", "count": 12},
      {"value": "Samsung", "count": 8}
    ]
  }
}

Filter Syntax

Pre-filters use boolean logic with AND/OR/NOT:
{
  "filters": {
    "AND": [
      {"field": "metadata.status", "operator": "eq", "value": "active"},
      {
        "OR": [
          {"field": "metadata.category", "operator": "eq", "value": "tech"},
          {"field": "metadata.category", "operator": "eq", "value": "science"}
        ]
      }
    ]
  }
}

Supported Operators

OperatorDescriptionExample
eqEquals{"field": "status", "operator": "eq", "value": "active"}
neNot equals{"field": "status", "operator": "ne", "value": "deleted"}
gtGreater than{"field": "price", "operator": "gt", "value": 100}
gteGreater than or equal{"field": "rating", "operator": "gte", "value": 4}
ltLess than{"field": "age", "operator": "lt", "value": 30}
lteLess than or equal{"field": "count", "operator": "lte", "value": 10}
inIn array{"field": "category", "operator": "in", "value": ["a", "b"]}
ninNot in array{"field": "status", "operator": "nin", "value": ["deleted", "archived"]}
containsContains substring{"field": "title", "operator": "contains", "value": "guide"}
existsField exists{"field": "metadata.optional", "operator": "exists", "value": true}

Performance

MetricValue
Latency10-50ms (single search)
Latency20-80ms (multi-search with fusion)
Optimal top_k100-500 per search
Maximum top_k10,000 per search
Fusion overhead< 5ms
For best performance, use pre-filters to reduce the search space. Filtering at the vector index level is much faster than post-filtering in later stages.

Common Pipeline Patterns

Basic Search + Rerank

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        }
      ],
      "final_top_k": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 10
    }
  }
]

Multimodal Search + Filter + Limit

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://image_extractor@v1/embedding",
          "query": "{{INPUT.image}}",
          "top_k": 100
        }
      ],
      "fusion": "rrf",
      "final_top_k": 50
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "attribute_filter",
    "parameters": {
      "field": "metadata.in_stock",
      "operator": "eq",
      "value": true
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "limit": 20
    }
  }
]

Video Search with Frame Grouping

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 500
        }
      ],
      "group_by": {
        "field": "metadata.video_id",
        "limit": 10,
        "group_size": 5
      },
      "final_top_k": 50
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Summarize why these video segments match the query"
    }
  }
]

E-commerce Search with Facets

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 200,
          "filters": {
            "AND": [
              {"field": "metadata.in_stock", "operator": "eq", "value": true},
              {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
            ]
          }
        }
      ],
      "facets": ["metadata.category", "metadata.brand", "metadata.color"],
      "final_top_k": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "sort_attribute",
    "parameters": {
      "sort_field": "{{INPUT.sort_by}}",
      "order": "{{INPUT.sort_order}}"
    }
  }
]

Output Schema

Each result includes:
FieldTypeDescription
document_idstringUnique document identifier
scorefloatCombined similarity score
contentstringDocument content
metadataobjectDocument metadata
featuresobjectFeature data and scores per search
Example output:
{
  "document_id": "doc_abc123",
  "score": 0.892,
  "content": "Document content here...",
  "metadata": {
    "title": "Example Document",
    "category": "tech",
    "created_at": "2024-01-15T10:30:00Z"
  },
  "features": {
    "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding": {
      "score": 0.91
    },
    "mixpeek://image_extractor@v1/embedding": {
      "score": 0.87
    }
  }
}

Comparison: feature_search vs attribute_filter

Aspectfeature_searchattribute_filter
PurposeSemantic similarityExact matching
InputQuery text/embeddingField conditions
ScoringVector similarityBinary match
Speed10-50ms5-20ms
Use whenFinding similar contentFiltering by metadata

Error Handling

ErrorBehavior
Invalid feature_uriStage fails with error
Empty queryReturns empty results
Filter syntax errorStage fails with error
No matching documentsReturns empty results
The following is a complete working example of creating a retriever that uses the feature_search stage, then executing it. Pay close attention to the field names — several are easy to confuse.
Common mistakes:
  • Use collection_identifiers (not collection_ids) in the retriever body.
  • Use type: "text" (not "string") in input_schema values.
  • stage_type at the outer level must be "filter".
  • stage_id: "feature_search" lives inside the config object, not at the outer stage_id.
  • final_top_k lives inside config.parameters, not at the top level.

Step 1 — Create the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "my-semantic-retriever",

    "collection_identifiers": ["col_abc123"],

    "input_schema": {
      "query": {
        "type": "text",
        "description": "Search query",
        "required": true
      }
    },

    "stages": [
      {
        "stage_name": "Semantic Search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": {
                  "input_mode": "text",
                  "value": "{{INPUT.query}}"
                },
                "top_k": 50
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

Step 2 — Execute the Retriever

curl -X POST "https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query": "machine learning for fashion brand compliance"
    }
  }'

Finding the Right feature_uri

The feature_uri must match an embedding index that exists in your namespace. To discover available feature URIs, list the vector indexes in a collection:
curl "https://api.mixpeek.com/v1/collections/$COLLECTION_ID" \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID"
Each item in the vector_indexes array has a feature_uri field — use that value directly in your retriever stage.