Skip to main content

Taxonomies

Auto-classify documents by matching them against reference collections. Two types: Flat — match each document against a single reference collection. When similarity exceeds the threshold, enrichment fields (SKU, category, label) are attached. Hierarchical — parent/child nodes with inheritance. Documents traverse levels of refinement (brand → category → subcategory) using different features at each level.
curl -X POST "https://api.mixpeek.com/v1/taxonomies" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "product-categories",
    "type": "flat",
    "reference_collection_id": "'$REF_COLLECTION_ID'",
    "feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
    "similarity_threshold": 0.75,
    "enrichment_fields": ["category", "subcategory", "brand"]
  }'

When to Run

ModeRunsUse case
on_demandAt query time as a retriever stageDynamic classification, A/B testing
materializeAfter extraction, persists to collectionStable labels, fast queries
retroactiveReapplies when taxonomy updatesBackfill when reference data improves
Taxonomy API →

Retriever Enrichments

Attach a retriever pipeline to a collection so it runs on every new document. The retriever executes, and selected result fields are written back to the document.
curl -X PATCH "https://api.mixpeek.com/v1/collections/$COLLECTION_ID" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_enrichments": [{
      "retriever_id": "'$RETRIEVER_ID'",
      "input_mappings": { "query_text": { "source": "payload", "path": "description" } },
      "write_back_fields": { "category": { "mode": "first", "path": "results[0].metadata.category" } }
    }]
  }'
Use cases: auto-classify via LLM, cross-collection joins, label propagation from seed documents. Collection update API →

Annotations

Explicit human decisions with full provenance — the ground truth layer for compliance and review workflows.
curl -X POST "https://api.mixpeek.com/v1/annotations" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_abc",
    "collection_id": "col_xyz",
    "retriever_id": "ret_123",
    "label": "approved",
    "confidence": 0.95,
    "reasoning": "Matches reference product exactly",
    "actor_id": "user_456",
    "actor_type": "human"
  }'
Each annotation tracks what was reviewed, how it was surfaced (retriever + execution + stage), and who decided. Bulk APIs support review queue workflows. Annotations feed into learned fusion to improve retrieval over time. Annotation API →

Choosing an Approach

GoalUse
Auto-label with a reference catalogFlat taxonomy (materialize mode)
Hierarchical classification (brand → category → SKU)Hierarchical taxonomy
Auto-classify via LLM at ingestRetriever enrichment with llm_enrich stage
Cross-collection joins (enrich from another dataset)Retriever enrichment with document_enrich stage
Human review with audit trailAnnotations
Backfill when labels improveRetroactive taxonomy application