Classify Content

For full configuration details, parameters, and advanced options, see the Taxonomies reference.

Taxonomies

Auto-classify documents by matching them against reference collections. Two types: Flat — match each document against a single reference collection. When similarity exceeds the threshold, enrichment fields (SKU, category, label) are attached. Hierarchical — parent/child nodes with inheritance. Documents traverse levels of refinement (brand → category → subcategory) using different features at each level.

curl -X POST "https://api.mixpeek.com/v1/taxonomies" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "product-categories",
    "type": "flat",
    "reference_collection_id": "'$REF_COLLECTION_ID'",
    "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
    "similarity_threshold": 0.75,
    "enrichment_fields": ["category", "subcategory", "brand"]
  }'

When to Run

Mode	Runs	Use case
`on_demand`	At query time as a retriever stage	Dynamic classification, A/B testing
`materialize`	After extraction, persists to collection	Stable labels, fast queries
`retroactive`	Reapplies when taxonomy updates	Backfill when reference data improves

Taxonomy API →

Retriever Enrichments

Attach a retriever pipeline to a collection so it runs on every new document. The retriever executes, and selected result fields are written back to the document.

curl -X PATCH "https://api.mixpeek.com/v1/collections/$COLLECTION_ID" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_enrichments": [{
      "retriever_id": "'$RETRIEVER_ID'",
      "input_mappings": { "query_text": { "source": "payload", "path": "description" } },
      "write_back_fields": { "category": { "mode": "first", "path": "results[0].metadata.category" } }
    }]
  }'

Use cases: auto-classify via LLM, cross-collection joins, label propagation from seed documents. Collection update API →

Annotations

Explicit human decisions with full provenance — the ground truth layer for compliance, review workflows, and improving retrieval quality over time.

curl -X POST "https://api.mixpeek.com/v1/annotations" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_abc",
    "collection_id": "col_xyz",
    "retriever_id": "ret_123",
    "execution_id": "exec_789",
    "stage_name": "feature_search",
    "label": "approved",
    "confidence": 0.95,
    "reasoning": "Matches reference product exactly",
    "payload": { "sku": "SKU-001", "action": "keep" },
    "actor_id": "user_456",
    "actor_type": "human"
  }'

What Each Annotation Captures

Field	Purpose
`document_id`, `collection_id`	What was reviewed
`retriever_id`, `execution_id`, `stage_name`	How it was surfaced
`label`, `confidence`, `reasoning`	The decision
`payload`	Structured workflow-specific data (SKU, action, notes)
`actor_id`, `actor_type`	Who decided (human or model)

Annotations are stored independently from documents — they never modify the source data. Use them to build review queues, audit trails, and curated ground truth datasets.

Bulk Operations

Process review queues at scale with the bulk API:

curl -X POST "https://api.mixpeek.com/v1/annotations/bulk" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "create": [
      { "document_id": "doc_1", "collection_id": "col_xyz", "label": "approved" },
      { "document_id": "doc_2", "collection_id": "col_xyz", "label": "rejected", "reasoning": "Low quality match" }
    ],
    "update": [],
    "delete": []
  }'

The Feedback Loop

Annotations feed directly into the platform’s learning cycle:

Annotations provide explicit ground truth for edge cases
Learned fusion uses annotations to auto-tune retriever stage weights
Approved annotations can be piped into reference collections, expanding your taxonomy’s coverage
Retroactive taxonomy application reclassifies existing documents when annotations improve the reference set

Annotation API → · Bulk API →

Choosing an Approach

Goal	Use
Auto-label with a reference catalog	Flat taxonomy (materialize mode)
Hierarchical classification (brand → category → SKU)	Hierarchical taxonomy
Auto-classify via LLM at ingest	Retriever enrichment with `llm_enrich` stage
Cross-collection joins (enrich from another dataset)	Retriever enrichment with `document_enrich` stage
Human review with audit trail	Annotations
Backfill when labels improve	Retroactive taxonomy application

​Taxonomies

​When to Run

​Retriever Enrichments

​Annotations

​What Each Annotation Captures

​Bulk Operations

​The Feedback Loop

​Choosing an Approach

Taxonomies

When to Run

Retriever Enrichments

Annotations

What Each Annotation Captures

Bulk Operations

The Feedback Loop

Choosing an Approach