Build a labeled dataset from scratch and auto-classify new data using taxonomy-based matching.
Auto-labeling uses the warehouse’s enrichment layer (taxonomies) to classify documents at query time, the multimodal equivalent of a SQL JOIN.
This tutorial shows how to:
- Start with unlabeled data
- Use feature extraction to find relevant items
- Manually label a small reference set
- Automatically classify new items based on the reference set
- Create a self-improving system that gets better over time
Overview
This tutorial demonstrates two approaches to building an auto-labeling system:
- Option A: Unified Approach (Recommended) - Single bucket/collection that grows smarter over time
- Option B: Separate Approach - Dedicated reference set with production data separated
Both approaches follow the same core workflow:
- Upload unlabeled data with feature extraction
- Manually label a small reference set (10-20 examples per category)
- Configure taxonomy to auto-label new items based on similarity
- Review and label unknowns to continuously improve
Use Cases
- Product Recognition: Label product images, auto-tag new inventory
- People Identification: Build a face recognition system from photos
- Document Classification: Categorize documents by type or topic
- Object Detection: Label objects in images for training data
Option A: Unified Approach (Recommended)
The unified approach uses a single bucket and collection that references itself. As you label items, they immediately become part of the reference set for future matches.
Step 1: Create Bucket and Collection
Create a bucket and collection with self-referencing taxonomy:
# Create bucket
POST /v1/buckets
{
"bucket_name": "products_unified",
"bucket_schema": {
"properties": {
"product_label": { "type": "text" },
"image_url": { "type": "text" }
}
}
}
# Create retriever (do this first, before collection)
POST /v1/retrievers
{
"retriever_name": "products_unified_classifier",
"collection_identifiers": ["products_unified"],
"stages": [
{
"stage_name": "labeled_filter",
"stage_type": "filter",
"config": {
"stage_id": "attribute_filter",
"parameters": {
"conditions": {
"AND": [
{ "field": "product_label", "operator": "exists", "value": true }
]
}
}
}
},
{
"stage_name": "image_match",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [
{
"feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
"query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
"top_k": 1,
"min_score": 0.30
}
]
}
}
}
]
}
# Create collection that references itself
POST /v1/collections
{
"collection_name": "products_unified",
"source": {
"type": "bucket",
"bucket_ids": ["bkt_products_unified"]
},
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": { "image": "image_url" },
"field_passthrough": ["product_label"]
},
"taxonomy": {
"retriever_id": "ret_products_unified_classifier",
"field_to_enrich": "product_label",
"confidence_threshold": 0.30
}
}
Step 2: Upload Initial Unlabeled Data
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/bootstrap",
"metadata": {
"product_label": null
},
"blobs": [{
"property": "image_url",
"type": "image",
"data": {
"url": "s3://my-bucket/products/shoe-001.jpg"
}
}]
}
Upload 50-100 images. Feature extraction happens automatically, but no auto-labeling occurs yet (no labeled examples to match against).
Step 3: Manually Label Reference Set
Query documents and label them:
# Get documents
GET /v1/collections/{collection_id}/documents?return_presigned_urls=true
# Label via bucket (syncs to collection automatically)
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
"metadata": {
"product_label": "Red Running Shoes"
}
}
Labeling tips:
- Label 10-20 examples per category minimum
- Include diverse examples (angles, lighting, backgrounds)
- Use consistent naming conventions
Step 4: Upload New Items - Auto-Labeling Works!
Now that you have labeled examples, new uploads auto-label automatically:
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/new-arrivals",
"blobs": [{
"property": "image_url",
"type": "image",
"data": {
"url": "s3://my-bucket/new-arrivals/shoe-new.jpg"
}
}]
}
What happens automatically:
- Feature extraction runs on the new image
- Taxonomy searches your labeled items for similar matches
- If similarity > 0.30 → Auto-labels (e.g.,
"Red Running Shoes")
- If similarity < 0.30 → Leaves as
null for manual review
Check the result:
GET /v1/collections/{collection_id}/documents/{document_id}
Matched:
{
"metadata": {
"product_label": "Red Running Shoes"
},
"taxonomy_match": {
"matched": true,
"confidence": 0.87,
"source_document_id": "doc_xyz123"
}
}
Unknown (needs manual review):
{
"metadata": {
"product_label": null
},
"taxonomy_match": {
"matched": false,
"confidence": 0.21
}
}
Step 5: Review and Label Unknowns
Find items that need manual labeling:
GET /v1/collections/{collection_id}/documents?filters={
"must": [
{
"key": "product_label",
"match": { "operator": "eq", "value": null }
}
]
}
Label them via bucket (automatically syncs to collection):
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
"metadata": {
"product_label": "Blue Basketball Shoes"
}
}
Self-improvement in action: This newly labeled item becomes part of the reference set for future uploads!
Option B: Separate Approach
For more control, keep reference data separate from production data:
- Reference bucket/collection: Curated, high-quality labeled examples
- Production bucket/collection: All data with auto-labels
When to use:
- Need strict quality control on reference set
- Want to prevent noisy auto-labels from affecting matching
- Prefer to manually review before promoting items to reference
Step 1: Create Reference Bucket and Collection
# Reference bucket
POST /v1/buckets
{
"bucket_name": "product_reference",
"bucket_schema": {
"properties": {
"product_label": { "type": "text" },
"image_url": { "type": "text" }
}
}
}
# Reference collection (no taxonomy needed)
POST /v1/collections
{
"collection_name": "product_reference",
"source": {
"type": "bucket",
"bucket_ids": ["bkt_product_reference"]
},
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": { "image": "image_url" },
"field_passthrough": ["product_label"]
}
}
# Create taxonomy retriever
POST /v1/retrievers
{
"retriever_name": "product_classifier",
"collection_identifiers": ["product_reference"],
"stages": [
{
"stage_name": "labeled_filter",
"stage_type": "filter",
"config": {
"stage_id": "attribute_filter",
"parameters": {
"conditions": {
"AND": [
{ "field": "product_label", "operator": "exists", "value": true }
]
}
}
}
},
{
"stage_name": "image_match",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [
{
"feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
"query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
"top_k": 1,
"min_score": 0.30
}
]
}
}
}
]
}
Step 2: Upload and Label Reference Set
Upload 50-100 curated images to the reference bucket and manually label them:
# Upload to reference
POST /v1/buckets/bkt_product_reference/objects
{
"metadata": { "product_label": null },
"blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}
# Label them
PATCH /v1/buckets/bkt_product_reference/objects/{object_id}
{
"metadata": { "product_label": "Red Running Shoes" }
}
Step 3: Create Production Bucket and Collection
# Production bucket
POST /v1/buckets
{
"bucket_name": "product_catalog",
"bucket_schema": {
"properties": {
"product_label": { "type": "text" },
"image_url": { "type": "text" }
}
}
}
# Production collection with taxonomy
POST /v1/collections
{
"collection_name": "product_catalog",
"source": {
"type": "bucket",
"bucket_ids": ["bkt_product_catalog"]
},
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": { "image": "image_url" },
"field_passthrough": ["product_label"]
},
"taxonomy": {
"retriever_id": "ret_product_classifier",
"field_to_enrich": "product_label",
"confidence_threshold": 0.30
}
}
Step 4: Upload Production Data
New uploads auto-label based on the reference set:
POST /v1/buckets/bkt_product_catalog/objects
{
"blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}
Periodically review production data and promote high-confidence matches:
# Find high-confidence items
GET /v1/collections/product_catalog/documents?filters={
"must": [
{
"key": "taxonomy_match.confidence",
"match": { "operator": "gte", "value": 0.85 }
}
]
}
# Copy to reference bucket
POST /v1/buckets/bkt_product_reference/objects
{
"metadata": { "product_label": "..." },
"blobs": [{ ... }]
}
Real-World Examples
Example 1: Face Recognition System
# Create bucket for employee photos
POST /v1/buckets
{
"bucket_name": "employee_photos",
"bucket_schema": {
"properties": {
"person_name": { "type": "text" },
"employee_id": { "type": "text" },
"photo_url": { "type": "text" }
}
}
}
# Bootstrap collection with face extraction
POST /v1/collections
{
"collection_name": "employee_faces",
"source": {
"type": "bucket",
"bucket_ids": ["bkt_employee_photos"]
},
"feature_extractor": {
"feature_extractor_name": "face_identity_extractor",
"version": "v1",
"input_mappings": { "image": "photo_url" },
"field_passthrough": ["person_name", "employee_id"]
}
}
# Upload 50 employee photos → manually label with names
# Create taxonomy retriever
# Security camera footage auto-identifies employees
Example 2: Document Classification
# Create bucket for documents
POST /v1/buckets
{
"bucket_name": "company_documents",
"bucket_schema": {
"properties": {
"document_type": { "type": "text" },
"content": { "type": "text" }
}
}
}
# Bootstrap collection with text extraction
POST /v1/collections
{
"collection_name": "document_types",
"source": {
"type": "bucket",
"bucket_ids": ["bkt_company_documents"]
},
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"input_mappings": { "text": "content" },
"field_passthrough": ["document_type"]
},
"taxonomy": {
"field_to_enrich": "document_type",
"confidence_threshold": 0.35
}
}
# Label 20 invoices, 20 contracts, 20 receipts
# New documents auto-classify by type
Advanced Configuration
Tuning Confidence Thresholds
The confidence_threshold determines how conservative auto-labeling is:
| Threshold | Behavior | Use Case |
|---|
0.20-0.25 | Aggressive | High recall, more false positives |
0.30-0.35 | Balanced | Good starting point |
0.40-0.50 | Conservative | High precision, fewer auto-labels |
0.60+ | Very strict | Only exact matches |
Finding the right threshold:
- Start with
0.30
- Monitor false positive rate (wrong auto-labels)
- Check coverage (% of items auto-labeled)
- Adjust based on cost of errors:
- High cost of errors (e.g., medical imaging) → Higher threshold
- Low cost of errors (e.g., photo organization) → Lower threshold
Monitoring & Analytics
Track performance with these queries:
# Get distribution of labels
GET /v1/collections/{collection_id}/analytics/field-distribution?field=product_label
# Check match confidence distribution
GET /v1/collections/{collection_id}/documents?sort_by=taxonomy_match.confidence&limit=100
# Find low-confidence matches for review
GET /v1/collections/{collection_id}/documents?filters={
"must": [
{
"key": "taxonomy_match.matched",
"match": { "operator": "eq", "value": true }
},
{
"key": "taxonomy_match.confidence",
"match": { "operator": "lt", "value": 0.40 }
}
]
}
Key metrics:
- Auto-label coverage: % of new items auto-labeled
- Manual review queue: # of items with
label: null
- Confidence distribution: Are matches clustered around threshold?
- False positive rate: Sample and manually verify auto-labels
Best Practices
Reference set quality:
- Include diverse examples (angles, lighting, backgrounds)
- Use consistent naming conventions
- Aim for balanced distribution across categories
- Maintain high-quality, unambiguous images
Labeling guidelines:
- Create a labeling style guide
- Consider hierarchical labels:
"Shoes > Running > Red"
- Define rules for edge cases
- Version your taxonomy as it evolves
Continuous improvement:
- Review unknowns regularly
- Audit auto-labels periodically
- Add corrected examples when system makes mistakes
- Expand categories as needed
Production deployment:
- Start with conservative threshold (0.40+)
- Implement human-in-the-loop for critical applications
- Enable feedback mechanism for corrections
- A/B test threshold changes
Troubleshooting
Too many unlabeled items
Causes: Threshold too high, insufficient reference examples, new categories
Solutions:
- Lower
confidence_threshold to 0.25-0.30
- Add 20+ examples per category to reference set
- Review and label new categories
False positives (wrong labels)
Causes: Threshold too low, similar categories, poor quality references
Solutions:
- Raise
confidence_threshold to 0.40+
- Add diverse examples to distinguish categories
- Clean up reference set
System not self-improving
Causes: Labels not syncing, configuration issues
Solutions:
- Verify
field_passthrough includes label field
- Check retriever filters for non-null labels
- Confirm bucket-to-collection sync is working
Summary
Workflow:
- Create bucket and collection with feature extraction
- Upload unlabeled data (50-100 items)
- Manually label reference set (10-20 per category)
- Create taxonomy retriever pointing to labeled items
- New uploads auto-label based on similarity
- Review and label unknowns to improve system
Key benefits:
- Start with zero labels, build incrementally
- Automate repetitive labeling
- Self-improving with each manual correction
- Scales from dozens to millions
Next steps:
- Choose unified (simpler) or separate (more control) approach
- Start with 50-100 reference items
- Test different confidence thresholds (start at 0.30)
- Monitor auto-label quality and adjust
Discover Clusters
Use clustering to find new categories before defining them manually:
POST /v1/clusters
{
"cluster_name": "product-discovery",
"collection_ids": ["col_products_unified"],
"cluster_type": "vector",
"vector_config": {
"feature_uris": ["mixpeek://image_extractor@v1/google_siglip_base_v1"],
"clustering_method": "hdbscan",
"hdbscan_parameters": { "min_cluster_size": 5 }
},
"llm_labeling": {
"enabled": true,
"input_mappings": [{ "source": "payload", "fields": ["product_label"] }]
},
"dimension_reduction": { "method": "umap", "n_components": 2 }
}
Clusters reveal groups you haven’t labeled yet — “sandals”, “boots”, “athletic wear” — without predefined categories. Once a cluster stabilizes, promote it to a taxonomy node so future items auto-classify into it. See Clusters.
Set Up Alerts
Get notified when items fail to auto-label (unknown categories needing manual review). An alert runs a retriever and fires on its results — so first create a retriever that surfaces unlabeled items, then point an alert at it.
# 1. Retriever that returns products with no label. attribute_filter as the
# first stage fetches + filters straight from the collection (no search needed).
POST /v1/retrievers
{
"retriever_name": "unlabeled-products",
"collection_identifiers": ["col_products_unified"],
"input_schema": {},
"stages": [
{ "stage_name": "missing_label", "stage_type": "filter",
"config": { "stage_id": "attribute_filter", "parameters": {
"field": "product_label", "operator": "exists", "value": false } } }
]
}
# -> { "retriever_id": "ret_unlabeled" }
# 2. Fire a webhook whenever that retriever returns any results
POST /v1/alerts
{
"name": "unknown-products",
"source": "retriever",
"retriever_id": "ret_unlabeled",
"trigger_on": "results",
"notification_config": {
"channels": [
{ "channel_type": "webhook", "config": { "url": "https://example.com/webhook" } }
]
}
}
See Alerts for system-metric alerts and Slack/email channels.
Set Up Webhooks
Forward ingestion and labeling events to your own systems:
POST /v1/organizations/webhooks
{
"webhook_name": "labeling-events",
"event_types": ["object.created", "object.updated"],
"channels": [
{ "channel": "webhook", "configs": { "url": "https://example.com/webhook", "method": "POST" } }
]
}
event_types accepts object/collection/cluster/taxonomy/alert lifecycle events (e.g. object.created, object.updated, object.deleted, collection.created). See Webhooks for the full event list and Slack/email channels.