Auto-Labeling Datasets

Build a labeled dataset from scratch and auto-classify new data using taxonomy-based matching.

Auto-labeling uses the warehouse’s enrichment layer (taxonomies) to classify documents at query time, the multimodal equivalent of a SQL JOIN.

This tutorial shows how to:

Start with unlabeled data
Use feature extraction to find relevant items
Manually label a small reference set
Automatically classify new items based on the reference set
Create a self-improving system that gets better over time

Overview

This tutorial demonstrates two approaches to building an auto-labeling system:

Option A: Unified Approach (Recommended) - Single bucket/collection that grows smarter over time
Option B: Separate Approach - Dedicated reference set with production data separated

Both approaches follow the same core workflow:

Upload unlabeled data with feature extraction
Manually label a small reference set (10-20 examples per category)
Configure taxonomy to auto-label new items based on similarity
Review and label unknowns to continuously improve

Use Cases

Product Recognition: Label product images, auto-tag new inventory
People Identification: Build a face recognition system from photos
Document Classification: Categorize documents by type or topic
Object Detection: Label objects in images for training data

Option A: Unified Approach (Recommended)

The unified approach uses a single bucket and collection that references itself. As you label items, they immediately become part of the reference set for future matches.

Step 1: Create Bucket and Collection

Create a bucket and collection with self-referencing taxonomy:

# Create bucket
POST /v1/buckets
{
  "bucket_name": "products_unified",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Create retriever (do this first, before collection)
POST /v1/retrievers
{
  "retriever_name": "products_unified_classifier",
  "collection_identifiers": ["products_unified"],
  "stages": [
    {
      "stage_name": "labeled_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "conditions": {
            "AND": [
              { "field": "product_label", "operator": "exists", "value": true }
            ]
          }
        }
      }
    },
    {
      "stage_name": "image_match",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
              "query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
              "top_k": 1,
              "min_score": 0.30
            }
          ]
        }
      }
    }
  ]
}

# Create collection that references itself
POST /v1/collections
{
  "collection_name": "products_unified",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_products_unified"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  },
  "taxonomy": {
    "retriever_id": "ret_products_unified_classifier",
    "field_to_enrich": "product_label",
    "confidence_threshold": 0.30
  }
}

Step 2: Upload Initial Unlabeled Data

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/bootstrap",
  "metadata": {
    "product_label": null
  },
  "blobs": [{
    "property": "image_url",
    "type": "image",
    "data": {
      "url": "s3://my-bucket/products/shoe-001.jpg"
    }
  }]
}

Upload 50-100 images. Feature extraction happens automatically, but no auto-labeling occurs yet (no labeled examples to match against).

Step 3: Manually Label Reference Set

Query documents and label them:

# Get documents
GET /v1/collections/{collection_id}/documents?return_presigned_urls=true

# Label via bucket (syncs to collection automatically)
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
  "metadata": {
    "product_label": "Red Running Shoes"
  }
}

Labeling tips:

Label 10-20 examples per category minimum
Include diverse examples (angles, lighting, backgrounds)
Use consistent naming conventions

Step 4: Upload New Items - Auto-Labeling Works!

Now that you have labeled examples, new uploads auto-label automatically:

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/new-arrivals",
  "blobs": [{
    "property": "image_url",
    "type": "image",
    "data": {
      "url": "s3://my-bucket/new-arrivals/shoe-new.jpg"
    }
  }]
}

What happens automatically:

Feature extraction runs on the new image
Taxonomy searches your labeled items for similar matches
If similarity > 0.30 → Auto-labels (e.g., "Red Running Shoes")
If similarity < 0.30 → Leaves as null for manual review

Check the result:

GET /v1/collections/{collection_id}/documents/{document_id}

Matched:

{
  "metadata": {
    "product_label": "Red Running Shoes"
  },
  "taxonomy_match": {
    "matched": true,
    "confidence": 0.87,
    "source_document_id": "doc_xyz123"
  }
}

Unknown (needs manual review):

{
  "metadata": {
    "product_label": null
  },
  "taxonomy_match": {
    "matched": false,
    "confidence": 0.21
  }
}

Step 5: Review and Label Unknowns

Find items that need manual labeling:

GET /v1/collections/{collection_id}/documents?filters={
  "must": [
    {
      "key": "product_label",
      "match": { "operator": "eq", "value": null }
    }
  ]
}

Label them via bucket (automatically syncs to collection):

PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
  "metadata": {
    "product_label": "Blue Basketball Shoes"
  }
}

Self-improvement in action: This newly labeled item becomes part of the reference set for future uploads!

Option B: Separate Approach

For more control, keep reference data separate from production data:

Reference bucket/collection: Curated, high-quality labeled examples
Production bucket/collection: All data with auto-labels

When to use:

Need strict quality control on reference set
Want to prevent noisy auto-labels from affecting matching
Prefer to manually review before promoting items to reference

Step 1: Create Reference Bucket and Collection

# Reference bucket
POST /v1/buckets
{
  "bucket_name": "product_reference",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Reference collection (no taxonomy needed)
POST /v1/collections
{
  "collection_name": "product_reference",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_product_reference"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  }
}

# Create taxonomy retriever
POST /v1/retrievers
{
  "retriever_name": "product_classifier",
  "collection_identifiers": ["product_reference"],
  "stages": [
    {
      "stage_name": "labeled_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "conditions": {
            "AND": [
              { "field": "product_label", "operator": "exists", "value": true }
            ]
          }
        }
      }
    },
    {
      "stage_name": "image_match",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
              "query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
              "top_k": 1,
              "min_score": 0.30
            }
          ]
        }
      }
    }
  ]
}

Step 2: Upload and Label Reference Set

Upload 50-100 curated images to the reference bucket and manually label them:

# Upload to reference
POST /v1/buckets/bkt_product_reference/objects
{
  "metadata": { "product_label": null },
  "blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}

# Label them
PATCH /v1/buckets/bkt_product_reference/objects/{object_id}
{
  "metadata": { "product_label": "Red Running Shoes" }
}

Step 3: Create Production Bucket and Collection

# Production bucket
POST /v1/buckets
{
  "bucket_name": "product_catalog",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Production collection with taxonomy
POST /v1/collections
{
  "collection_name": "product_catalog",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_product_catalog"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  },
  "taxonomy": {
    "retriever_id": "ret_product_classifier",
    "field_to_enrich": "product_label",
    "confidence_threshold": 0.30
  }
}

Step 4: Upload Production Data

New uploads auto-label based on the reference set:

POST /v1/buckets/bkt_product_catalog/objects
{
  "blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}

Step 5: Promote High-Confidence Items to Reference

Periodically review production data and promote high-confidence matches:

# Find high-confidence items
GET /v1/collections/product_catalog/documents?filters={
  "must": [
    {
      "key": "taxonomy_match.confidence",
      "match": { "operator": "gte", "value": 0.85 }
    }
  ]
}

# Copy to reference bucket
POST /v1/buckets/bkt_product_reference/objects
{
  "metadata": { "product_label": "..." },
  "blobs": [{ ... }]
}

Real-World Examples

Example 1: Face Recognition System

# Create bucket for employee photos
POST /v1/buckets
{
  "bucket_name": "employee_photos",
  "bucket_schema": {
    "properties": {
      "person_name": { "type": "text" },
      "employee_id": { "type": "text" },
      "photo_url": { "type": "text" }
    }
  }
}

# Bootstrap collection with face extraction
POST /v1/collections
{
  "collection_name": "employee_faces",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_employee_photos"]
  },
  "feature_extractor": {
    "feature_extractor_name": "face_identity_extractor",
    "version": "v1",
    "input_mappings": { "image": "photo_url" },
    "field_passthrough": ["person_name", "employee_id"]
  }
}

# Upload 50 employee photos → manually label with names
# Create taxonomy retriever
# Security camera footage auto-identifies employees

Example 2: Document Classification

# Create bucket for documents
POST /v1/buckets
{
  "bucket_name": "company_documents",
  "bucket_schema": {
    "properties": {
      "document_type": { "type": "text" },
      "content": { "type": "text" }
    }
  }
}

# Bootstrap collection with text extraction
POST /v1/collections
{
  "collection_name": "document_types",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_company_documents"]
  },
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": { "text": "content" },
    "field_passthrough": ["document_type"]
  },
  "taxonomy": {
    "field_to_enrich": "document_type",
    "confidence_threshold": 0.35
  }
}

# Label 20 invoices, 20 contracts, 20 receipts
# New documents auto-classify by type

Advanced Configuration

Tuning Confidence Thresholds

The confidence_threshold determines how conservative auto-labeling is:

Threshold	Behavior	Use Case
`0.20-0.25`	Aggressive	High recall, more false positives
`0.30-0.35`	Balanced	Good starting point
`0.40-0.50`	Conservative	High precision, fewer auto-labels
`0.60+`	Very strict	Only exact matches

Finding the right threshold:

Start with 0.30
Monitor false positive rate (wrong auto-labels)
Check coverage (% of items auto-labeled)
Adjust based on cost of errors:
- High cost of errors (e.g., medical imaging) → Higher threshold
- Low cost of errors (e.g., photo organization) → Lower threshold

Monitoring & Analytics

Track performance with these queries:

# Get distribution of labels
GET /v1/collections/{collection_id}/analytics/field-distribution?field=product_label

# Check match confidence distribution
GET /v1/collections/{collection_id}/documents?sort_by=taxonomy_match.confidence&limit=100

# Find low-confidence matches for review
GET /v1/collections/{collection_id}/documents?filters={
  "must": [
    {
      "key": "taxonomy_match.matched",
      "match": { "operator": "eq", "value": true }
    },
    {
      "key": "taxonomy_match.confidence",
      "match": { "operator": "lt", "value": 0.40 }
    }
  ]
}

Key metrics:

Auto-label coverage: % of new items auto-labeled
Manual review queue: # of items with label: null
Confidence distribution: Are matches clustered around threshold?
False positive rate: Sample and manually verify auto-labels

Best Practices

Reference set quality:

Include diverse examples (angles, lighting, backgrounds)
Use consistent naming conventions
Aim for balanced distribution across categories
Maintain high-quality, unambiguous images

Labeling guidelines:

Create a labeling style guide
Consider hierarchical labels: "Shoes > Running > Red"
Define rules for edge cases
Version your taxonomy as it evolves

Continuous improvement:

Review unknowns regularly
Audit auto-labels periodically
Add corrected examples when system makes mistakes
Expand categories as needed

Production deployment:

Start with conservative threshold (0.40+)
Implement human-in-the-loop for critical applications
Enable feedback mechanism for corrections
A/B test threshold changes

Troubleshooting

Too many unlabeled items

Causes: Threshold too high, insufficient reference examples, new categories Solutions:

Lower confidence_threshold to 0.25-0.30
Add 20+ examples per category to reference set
Review and label new categories

False positives (wrong labels)

Causes: Threshold too low, similar categories, poor quality references Solutions:

Raise confidence_threshold to 0.40+
Add diverse examples to distinguish categories
Clean up reference set

System not self-improving

Causes: Labels not syncing, configuration issues Solutions:

Verify field_passthrough includes label field
Check retriever filters for non-null labels
Confirm bucket-to-collection sync is working

Summary

Workflow:

Create bucket and collection with feature extraction
Upload unlabeled data (50-100 items)
Manually label reference set (10-20 per category)
Create taxonomy retriever pointing to labeled items
New uploads auto-label based on similarity
Review and label unknowns to improve system

Key benefits:

Start with zero labels, build incrementally
Automate repetitive labeling
Self-improving with each manual correction
Scales from dozens to millions

Next steps:

Choose unified (simpler) or separate (more control) approach
Start with 50-100 reference items
Test different confidence thresholds (start at 0.30)
Monitor auto-label quality and adjust

Discover Clusters

Use clustering to find new categories before defining them manually:

POST /v1/clusters
{
  "cluster_name": "product-discovery",
  "collection_ids": ["col_products_unified"],
  "cluster_type": "vector",
  "vector_config": {
    "feature_uris": ["mixpeek://image_extractor@v1/google_siglip_base_v1"],
    "clustering_method": "hdbscan",
    "hdbscan_parameters": { "min_cluster_size": 5 }
  },
  "llm_labeling": {
    "enabled": true,
    "input_mappings": [{ "source": "payload", "fields": ["product_label"] }]
  },
  "dimension_reduction": { "method": "umap", "n_components": 2 }
}

Clusters reveal groups you haven’t labeled yet — “sandals”, “boots”, “athletic wear” — without predefined categories. Once a cluster stabilizes, promote it to a taxonomy node so future items auto-classify into it. See Clusters.

Set Up Alerts

Get notified when items fail to auto-label (unknown categories needing manual review). An alert runs a retriever and fires on its results — so first create a retriever that surfaces unlabeled items, then point an alert at it.

# 1. Retriever that returns products with no label. attribute_filter as the
#    first stage fetches + filters straight from the collection (no search needed).
POST /v1/retrievers
{
  "retriever_name": "unlabeled-products",
  "collection_identifiers": ["col_products_unified"],
  "input_schema": {},
  "stages": [
    { "stage_name": "missing_label", "stage_type": "filter",
      "config": { "stage_id": "attribute_filter", "parameters": {
        "field": "product_label", "operator": "exists", "value": false } } }
  ]
}
# -> { "retriever_id": "ret_unlabeled" }

# 2. Fire a webhook whenever that retriever returns any results
POST /v1/alerts
{
  "name": "unknown-products",
  "source": "retriever",
  "retriever_id": "ret_unlabeled",
  "trigger_on": "results",
  "notification_config": {
    "channels": [
      { "channel_type": "webhook", "config": { "url": "https://example.com/webhook" } }
    ]
  }
}

See Alerts for system-metric alerts and Slack/email channels.

Set Up Webhooks

Forward ingestion and labeling events to your own systems:

POST /v1/organizations/webhooks
{
  "webhook_name": "labeling-events",
  "event_types": ["object.created", "object.updated"],
  "channels": [
    { "channel": "webhook", "configs": { "url": "https://example.com/webhook", "method": "POST" } }
  ]
}

event_types accepts object/collection/cluster/taxonomy/alert lifecycle events (e.g. object.created, object.updated, object.deleted, collection.created). See Webhooks for the full event list and Slack/email channels.

​Overview

​Use Cases

​Option A: Unified Approach (Recommended)

​Step 1: Create Bucket and Collection

​Step 2: Upload Initial Unlabeled Data

​Step 3: Manually Label Reference Set

​Step 4: Upload New Items - Auto-Labeling Works!

​Step 5: Review and Label Unknowns

​Option B: Separate Approach

​Step 1: Create Reference Bucket and Collection

​Step 2: Upload and Label Reference Set

​Step 3: Create Production Bucket and Collection

​Step 4: Upload Production Data

​Step 5: Promote High-Confidence Items to Reference

​Real-World Examples

​Example 1: Face Recognition System

​Example 2: Document Classification

​Advanced Configuration

​Tuning Confidence Thresholds

​Monitoring & Analytics

​Best Practices

​Troubleshooting

​Too many unlabeled items

​False positives (wrong labels)

​System not self-improving

​Summary

​Discover Clusters

​Set Up Alerts

​Set Up Webhooks

Overview

Use Cases

Option A: Unified Approach (Recommended)

Step 1: Create Bucket and Collection

Step 2: Upload Initial Unlabeled Data

Step 3: Manually Label Reference Set

Step 4: Upload New Items - Auto-Labeling Works!

Step 5: Review and Label Unknowns

Option B: Separate Approach

Step 1: Create Reference Bucket and Collection

Step 2: Upload and Label Reference Set

Step 3: Create Production Bucket and Collection

Step 4: Upload Production Data

Step 5: Promote High-Confidence Items to Reference

Real-World Examples

Example 1: Face Recognition System

Example 2: Document Classification

Advanced Configuration

Tuning Confidence Thresholds

Monitoring & Analytics

Best Practices

Troubleshooting

Too many unlabeled items

False positives (wrong labels)

System not self-improving

Summary

Discover Clusters

Set Up Alerts

Set Up Webhooks