> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Taxonomies

> Enrich documents with similarity-based joins

Taxonomies let you attach structured metadata to documents by matching them against a reference collection. They are implemented as retriever-powered joins and can run on demand or be materialized into collections. Taxonomies are warehouse-native enrichment: the multimodal equivalent of a SQL JOIN, linking documents to canonical entities via embedding similarity rather than key equality.

## Taxonomy Types

| Type         | Structure                           | When to Use                                          |
| ------------ | ----------------------------------- | ---------------------------------------------------- |
| Flat         | Single-level reference collection   | Face enrollment, entity linking, simple lookups      |
| Hierarchical | Parent/child nodes with inheritance | Org charts, product categories, multi-level labeling |

Each node references a collection, retriever, and list of enrichment fields. Child nodes inherit parent properties automatically.

### Flat Taxonomy: Product Catalog Recognition

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/mixpeek-flat-taxonomy.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=9902f166d2881ee94901487218cd4674" alt="Flat taxonomy showing multimodal documents matched against a product catalog reference collection" width="1200" height="750" data-path="assets/mixpeek-flat-taxonomy.svg" />
</Frame>

In a flat taxonomy, documents from any modality (video, image, audio, text) are matched against a single reference collection. Each document uses its appropriate feature embedding (CLIP for visual, text embeddings for audio transcripts) to find the best match. Enrichment fields (SKU, category, price) are attached when similarity exceeds the threshold.

### Hierarchical Taxonomy: Media Content Classification

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/mixpeek-hierarchical-taxonomy.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=35fa9d2cf3a194faaf4b59a62a4dc655" alt="Hierarchical taxonomy showing media content classification with multiple levels from brand to campaign" width="1200" height="920" data-path="assets/mixpeek-hierarchical-taxonomy.svg" />
</Frame>

In a hierarchical taxonomy, documents traverse multiple levels of progressive refinement. Starting from a broad brand classification (1 node), through content category (2 nodes), sport/style type (4 nodes), audience segmentation (5 nodes), to specific campaigns (6 nodes). Each level narrows the classification using different multimodal features—CLIP for brand detection, scene classification for categories, activity detection for sport types, demographic models for audiences, and campaign-specific patterns at the final level. Documents inherit all properties from parent nodes as they traverse down the tree.

## Execution Modes

| Mode          | Description                                                                 | Use Case                                               |
| ------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
| `on_demand`   | Enrich documents at query time inside a retriever (`taxonomy_enrich` stage) | Exploratory workflows, testing, dynamic reference data |
| `materialize` | Batch enrichment after extraction; results persisted in the collection      | Production search, low-latency retrieval, analytics    |
| `retroactive` | Apply taxonomy to existing documents in a collection                        | Backfilling, taxonomy updates, schema migrations       |

Configure execution mode via a collection's `taxonomy_applications` array or by adding a taxonomy stage to a [retriever](/retrieval/retrievers).

### How Hierarchical Taxonomies Execute

Hierarchical taxonomies are executed like **Common Table Expressions (CTEs)** in SQL—each level builds on the results of the previous level, creating a recursive evaluation chain from root to leaf nodes.

```
Level 1 (Root)     →  Match against brand collection
    ↓ passes matched docs
Level 2            →  Match against category collection (filtered by L1 result)
    ↓ passes matched docs
Level 3            →  Match against subcategory collection (filtered by L2 result)
    ↓ passes matched docs
Level N (Leaf)     →  Final enrichment fields attached
```

At each level:

1. Documents that matched the parent node are passed down
2. The child node's retriever executes against its reference collection
3. Enrichment fields from matching nodes are accumulated
4. Only documents exceeding the similarity threshold continue to child nodes

This CTE-style execution ensures that a document classified as "Nike → Athletic → Running" inherits enrichment fields from all three levels, not just the leaf node.

### Application Methods

Hierarchical taxonomies can be applied through three methods:

| Method           | When It Runs                                                           | Use Case                                                                  |
| ---------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| **On-demand**    | Query time, as a [retriever stage](/retrieval/retrievers#apply-stages) | Dynamic classification, A/B testing taxonomy versions, low-volume queries |
| **Materialized** | During collection processing (post-extraction)                         | Production search requiring low latency, analytics dashboards             |
| **Retroactive**  | Manually triggered via API                                             | Backfilling existing documents, applying updated taxonomy versions        |

**On-demand** enrichment adds a `taxonomy_enrich` stage to your retriever pipeline:

```json theme={null}
{
  "stage_name": "taxonomy_enrich",
  "stage_type": "enrich",
  "config": {
    "stage_id": "taxonomy_enrich",
    "parameters": {
      "taxonomy_id": "tax_org_roles",
      "min_score": 0.4
    }
  }
}
```

**Materialized** enrichment runs automatically after document extraction completes. Configure it in the collection's `taxonomy_applications`:

```json theme={null}
{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_product_hierarchy",
      "execution_mode": "materialize"
    }
  ]
}
```

**Retroactive** enrichment applies a taxonomy to documents already in the collection:

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/collections/<collection_id>/apply-taxonomy" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_id": "tax_product_hierarchy",
    "filter": {
      "field": "metadata.needs_reclassification",
      "operator": "eq",
      "value": true
    }
  }'
```

Use retroactive application when:

* You've updated a taxonomy and need to reclassify existing documents
* You're migrating from a flat taxonomy to a hierarchical one
* You've added new reference data to taxonomy collections

## Internals: JOIN Stage

Taxonomies reuse the `join@v1` stage under the hood:

* **Direct join** – key-based match (`join_type: "direct"`).
* **Retriever join** – similarity match using a nested retriever (`join_type: "retriever"`).
* **Join strategies** – `replace`, `enrich`, `left`, or `append` control how fields merge.

Parallel execution (`asyncio.gather`) makes retrieval joins 10–50× faster than sequential lookups.

## Create a Flat Taxonomy

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_identity_extractor@v1/insightface__arcface"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'
```

## Create a Hierarchical Taxonomy

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "org_roles",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_identity_extractor@v1/insightface__arcface"
    },
    "hierarchy": [
      {
        "node_id": "employees",
        "collection_id": "col_employee_embeddings",
        "enrichment_fields": ["metadata.employee_id", "metadata.department"]
      },
      {
        "node_id": "executives",
        "collection_id": "col_executives",
        "retriever_id": "ret_executive_face",
        "parent_node_id": "employees",
        "enrichment_fields": ["metadata.executive_level", "metadata.budget_authority"]
      }
    ]
  }'
```

Hierarchical nodes inherit parent enrichment properties; children can override or extend them.

## Attach to a Collection

```json theme={null}
{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_employee_faces",
      "execution_mode": "materialize"
    },
    {
      "taxonomy_id": "tax_org_roles",
      "execution_mode": "on_demand"
    }
  ]
}
```

* Materialized enrichment updates documents \~30 seconds after ingestion completes (debounced to avoid thrashing).
* On-demand enrichment keeps documents untouched; retrievers call the taxonomy join at query time.

## Test On Demand

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      {
        "document_id": "doc_scene_123",
        "mixpeek://face_identity_extractor@v1/insightface__arcface": [0.12, 0.34, ...]
      }
    ],
    "mode": "on_demand"
  }'
```

## Inference Strategies

* **Manual** – Define nodes explicitly (IDs, collections, retrievers).
* **Schema-based** – Infer nodes from existing collection schemas (planned).
* **Cluster-based** – Create nodes from clustering output.
* **LLM-based** – Generate hierarchical structure from sample documents.

Combining strategies is encouraged: bootstrap via inference, then fine-tune manually.

## Monitoring

* List taxonomies: `POST /v1/taxonomies/list`
* Inspect hierarchy and node metadata: `GET /v1/taxonomies/{id}?expand_nodes=true`
* Track materialized enrichment progress via webhook events (`collection.documents.written`)
* Use retriever analytics to ensure taxonomy stages don’t dominate latency.

## Best Practices

1. **Start flat** for quick wins; layer hierarchies once value is proven.
2. **Keep enrichment minimal**—copy only fields needed at query time.
3. **Cache taxonomy stages** in retrievers when reference collections rarely change.
4. **Version taxonomies** (via snapshots) before major structural changes.
5. **Combine with clusters** to discover candidate nodes and measure coverage.

Taxonomies let you inject domain knowledge into multimodal search—link documents to canonical entities without relying on brittle key joins.
