Taxonomies

Taxonomies let you attach structured metadata to documents by matching them against a reference collection. They are implemented as retriever-powered joins and can run on demand or be materialized into collections. Taxonomies are warehouse-native enrichment: the multimodal equivalent of a SQL JOIN, linking documents to canonical entities via embedding similarity rather than key equality.

Taxonomy Types

Type	Structure	When to Use
Flat	Single-level reference collection	Face enrollment, entity linking, simple lookups
Hierarchical	Parent/child nodes with inheritance	Org charts, product categories, multi-level labeling

Each node references a collection, retriever, and list of enrichment fields. Child nodes inherit parent properties automatically.

Flat Taxonomy: Product Catalog Recognition

Flat taxonomy showing multimodal documents matched against a product catalog reference collection

In a flat taxonomy, documents from any modality (video, image, audio, text) are matched against a single reference collection. Each document uses its appropriate feature embedding (CLIP for visual, text embeddings for audio transcripts) to find the best match. Enrichment fields (SKU, category, price) are attached when similarity exceeds the threshold.

Hierarchical Taxonomy: Media Content Classification

In a hierarchical taxonomy, documents traverse multiple levels of progressive refinement. Starting from a broad brand classification (1 node), through content category (2 nodes), sport/style type (4 nodes), audience segmentation (5 nodes), to specific campaigns (6 nodes). Each level narrows the classification using different multimodal features—CLIP for brand detection, scene classification for categories, activity detection for sport types, demographic models for audiences, and campaign-specific patterns at the final level. Documents inherit all properties from parent nodes as they traverse down the tree.

Execution Modes

Mode	Description	Use Case
`on_demand`	Enrich documents at query time inside a retriever (`taxonomy_enrich` stage)	Exploratory workflows, testing, dynamic reference data
`materialize`	Batch enrichment after extraction; results persisted in the collection	Production search, low-latency retrieval, analytics
`retroactive`	Apply taxonomy to existing documents in a collection	Backfilling, taxonomy updates, schema migrations

Configure execution mode via a collection’s taxonomy_applications array or by adding a taxonomy stage to a retriever.

How Hierarchical Taxonomies Execute

Hierarchical taxonomies are executed like Common Table Expressions (CTEs) in SQL—each level builds on the results of the previous level, creating a recursive evaluation chain from root to leaf nodes.

Level 1 (Root)     →  Match against brand collection
    ↓ passes matched docs
Level 2            →  Match against category collection (filtered by L1 result)
    ↓ passes matched docs
Level 3            →  Match against subcategory collection (filtered by L2 result)
    ↓ passes matched docs
Level N (Leaf)     →  Final enrichment fields attached

At each level:

Documents that matched the parent node are passed down
The child node’s retriever executes against its reference collection
Enrichment fields from matching nodes are accumulated
Only documents exceeding the similarity threshold continue to child nodes

This CTE-style execution ensures that a document classified as “Nike → Athletic → Running” inherits enrichment fields from all three levels, not just the leaf node.

Application Methods

Hierarchical taxonomies can be applied through three methods:

Method	When It Runs	Use Case
On-demand	Query time, as a retriever stage	Dynamic classification, A/B testing taxonomy versions, low-volume queries
Materialized	During collection processing (post-extraction)	Production search requiring low latency, analytics dashboards
Retroactive	Manually triggered via API	Backfilling existing documents, applying updated taxonomy versions

On-demand enrichment adds a taxonomy_enrich stage to your retriever pipeline:

{
  "stage_type": "enrich",
  "stage_id": "taxonomy_enrich",
  "parameters": {
    "taxonomy_id": "tax_org_roles",
    "input_mappings": {
      "query_embedding": "{{DOC.mixpeek://face_detector@v2/face_embedding}}"
    }
  }
}

Materialized enrichment runs automatically after document extraction completes. Configure it in the collection’s taxonomy_applications:

{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_product_hierarchy",
      "execution_mode": "materialize"
    }
  ]
}

Retroactive enrichment applies a taxonomy to documents already in the collection:

curl -sS -X POST "$MP_API_URL/v1/collections/<collection_id>/apply-taxonomy" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_id": "tax_product_hierarchy",
    "filter": {
      "field": "metadata.needs_reclassification",
      "operator": "eq",
      "value": true
    }
  }'

Use retroactive application when:

You’ve updated a taxonomy and need to reclassify existing documents
You’re migrating from a flat taxonomy to a hierarchical one
You’ve added new reference data to taxonomy collections

Internals: JOIN Stage

Taxonomies reuse the join@v1 stage under the hood:

Direct join – key-based match (join_type: "direct").
Retriever join – similarity match using a nested retriever (join_type: "retriever").
Join strategies – replace, enrich, left, or append control how fields merge.

Parallel execution (asyncio.gather) makes retrieval joins 10–50× faster than sequential lookups.

Create a Flat Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'

Create a Hierarchical Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "org_roles",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "hierarchy": [
      {
        "node_id": "employees",
        "collection_id": "col_employee_embeddings",
        "enrichment_fields": ["metadata.employee_id", "metadata.department"]
      },
      {
        "node_id": "executives",
        "collection_id": "col_executives",
        "retriever_id": "ret_executive_face",
        "parent_node_id": "employees",
        "enrichment_fields": ["metadata.executive_level", "metadata.budget_authority"]
      }
    ]
  }'

Hierarchical nodes inherit parent enrichment properties; children can override or extend them.

Attach to a Collection

{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_employee_faces",
      "execution_mode": "materialize"
    },
    {
      "taxonomy_id": "tax_org_roles",
      "execution_mode": "on_demand"
    }
  ]
}

Materialized enrichment updates documents ~30 seconds after ingestion completes (debounced to avoid thrashing).
On-demand enrichment keeps documents untouched; retrievers call the taxonomy join at query time.

Test On Demand

curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      {
        "document_id": "doc_scene_123",
        "mixpeek://face_detector@v2/face_embedding": [0.12, 0.34, ...]
      }
    ],
    "mode": "on_demand"
  }'

Inference Strategies

Manual – Define nodes explicitly (IDs, collections, retrievers).
Schema-based – Infer nodes from existing collection schemas (planned).
Cluster-based – Create nodes from clustering output.
LLM-based – Generate hierarchical structure from sample documents.

Combining strategies is encouraged: bootstrap via inference, then fine-tune manually.

Monitoring

List taxonomies: POST /v1/taxonomies/list
Inspect hierarchy and node metadata: GET /v1/taxonomies/{id}?expand_nodes=true
Track materialized enrichment progress via webhook events (collection.documents.written)
Use retriever analytics to ensure taxonomy stages don’t dominate latency.

Best Practices

Start flat for quick wins; layer hierarchies once value is proven.
Keep enrichment minimal—copy only fields needed at query time.
Cache taxonomy stages in retrievers when reference collections rarely change.
Version taxonomies (via snapshots) before major structural changes.
Combine with clusters to discover candidate nodes and measure coverage.

Taxonomies let you inject domain knowledge into multimodal search—link documents to canonical entities without relying on brittle key joins.

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Taxonomy Types

Flat Taxonomy: Product Catalog Recognition

Hierarchical Taxonomy: Media Content Classification

Execution Modes

How Hierarchical Taxonomies Execute

Application Methods

Internals: JOIN Stage

Create a Flat Taxonomy

Create a Hierarchical Taxonomy

Attach to a Collection

Test On Demand

Inference Strategies

Monitoring

Best Practices

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Documentation Index

​Taxonomy Types

​Flat Taxonomy: Product Catalog Recognition

​Hierarchical Taxonomy: Media Content Classification

​Execution Modes

​How Hierarchical Taxonomies Execute

​Application Methods

​Internals: JOIN Stage

​Create a Flat Taxonomy

​Create a Hierarchical Taxonomy

​Attach to a Collection

​Test On Demand

​Inference Strategies

​Monitoring

​Best Practices

Taxonomy Types

Flat Taxonomy: Product Catalog Recognition

Hierarchical Taxonomy: Media Content Classification

Execution Modes

How Hierarchical Taxonomies Execute

Application Methods

Internals: JOIN Stage

Create a Flat Taxonomy

Create a Hierarchical Taxonomy

Attach to a Collection

Test On Demand

Inference Strategies

Monitoring

Best Practices