> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Auto-Labeling Datasets

> Build a self-improving classification system using taxonomy auto-labeling

**Build a labeled dataset from scratch and auto-classify new data using taxonomy-based matching.**

<Tip>Auto-labeling uses the warehouse's enrichment layer (taxonomies) to classify documents at query time, the multimodal equivalent of a SQL JOIN.</Tip>

This tutorial shows how to:

1. Start with unlabeled data
2. Use feature extraction to find relevant items
3. Manually label a small reference set
4. Automatically classify new items based on the reference set
5. Create a self-improving system that gets better over time

<img src="https://mintcdn.com/mixpeek/TmiAqiYj-LwmWL2a/assets/tutorials/bootstrap-labeled-dataset.svg?fit=max&auto=format&n=TmiAqiYj-LwmWL2a&q=85&s=0bb8096d894c78c46141b27f8538e8a9" alt="Bootstrap Labeled Dataset Workflow" width="1200" height="500" data-path="assets/tutorials/bootstrap-labeled-dataset.svg" />

***

## Overview

This tutorial demonstrates two approaches to building an auto-labeling system:

* **Option A: Unified Approach** (Recommended) - Single bucket/collection that grows smarter over time
* **Option B: Separate Approach** - Dedicated reference set with production data separated

Both approaches follow the same core workflow:

1. Upload unlabeled data with feature extraction
2. Manually label a small reference set (10-20 examples per category)
3. Configure taxonomy to auto-label new items based on similarity
4. Review and label unknowns to continuously improve

***

## Use Cases

* **Product Recognition**: Label product images, auto-tag new inventory
* **People Identification**: Build a face recognition system from photos
* **Document Classification**: Categorize documents by type or topic
* **Object Detection**: Label objects in images for training data

***

## Option A: Unified Approach (Recommended)

The unified approach uses a single bucket and collection that references itself. As you label items, they immediately become part of the reference set for future matches.

### Step 1: Create Bucket and Collection

Create a bucket and collection with self-referencing taxonomy:

```bash theme={null}
# Create bucket
POST /v1/buckets
{
  "bucket_name": "products_unified",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Create retriever (do this first, before collection)
POST /v1/retrievers
{
  "retriever_name": "products_unified_classifier",
  "collection_identifiers": ["products_unified"],
  "stages": [
    {
      "stage_name": "labeled_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "conditions": {
            "AND": [
              { "field": "product_label", "operator": "exists", "value": true }
            ]
          }
        }
      }
    },
    {
      "stage_name": "image_match",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
              "query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
              "top_k": 1,
              "min_score": 0.30
            }
          ]
        }
      }
    }
  ]
}

# Create collection that references itself
POST /v1/collections
{
  "collection_name": "products_unified",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_products_unified"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  },
  "taxonomy": {
    "retriever_id": "ret_products_unified_classifier",
    "field_to_enrich": "product_label",
    "confidence_threshold": 0.30
  }
}
```

### Step 2: Upload Initial Unlabeled Data

```bash theme={null}
POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/bootstrap",
  "metadata": {
    "product_label": null
  },
  "blobs": [{
    "property": "image_url",
    "type": "image",
    "data": {
      "url": "s3://my-bucket/products/shoe-001.jpg"
    }
  }]
}
```

Upload 50-100 images. Feature extraction happens automatically, but no auto-labeling occurs yet (no labeled examples to match against).

### Step 3: Manually Label Reference Set

Query documents and label them:

```bash theme={null}
# Get documents
GET /v1/collections/{collection_id}/documents?return_presigned_urls=true

# Label via bucket (syncs to collection automatically)
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
  "metadata": {
    "product_label": "Red Running Shoes"
  }
}
```

**Labeling tips:**

* Label 10-20 examples per category minimum
* Include diverse examples (angles, lighting, backgrounds)
* Use consistent naming conventions

### Step 4: Upload New Items - Auto-Labeling Works!

Now that you have labeled examples, new uploads auto-label automatically:

```bash theme={null}
POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/new-arrivals",
  "blobs": [{
    "property": "image_url",
    "type": "image",
    "data": {
      "url": "s3://my-bucket/new-arrivals/shoe-new.jpg"
    }
  }]
}
```

**What happens automatically:**

1. Feature extraction runs on the new image
2. Taxonomy searches your labeled items for similar matches
3. If similarity > 0.30 → Auto-labels (e.g., `"Red Running Shoes"`)
4. If similarity \< 0.30 → Leaves as `null` for manual review

**Check the result:**

```bash theme={null}
GET /v1/collections/{collection_id}/documents/{document_id}
```

**Matched:**

```json theme={null}
{
  "metadata": {
    "product_label": "Red Running Shoes"
  },
  "taxonomy_match": {
    "matched": true,
    "confidence": 0.87,
    "source_document_id": "doc_xyz123"
  }
}
```

**Unknown (needs manual review):**

```json theme={null}
{
  "metadata": {
    "product_label": null
  },
  "taxonomy_match": {
    "matched": false,
    "confidence": 0.21
  }
}
```

### Step 5: Review and Label Unknowns

Find items that need manual labeling:

```bash theme={null}
GET /v1/collections/{collection_id}/documents?filters={
  "must": [
    {
      "key": "product_label",
      "match": { "operator": "eq", "value": null }
    }
  ]
}
```

Label them via bucket (automatically syncs to collection):

```bash theme={null}
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{
  "metadata": {
    "product_label": "Blue Basketball Shoes"
  }
}
```

**Self-improvement in action**: This newly labeled item becomes part of the reference set for future uploads!

***

## Option B: Separate Approach

For more control, keep reference data separate from production data:

* **Reference bucket/collection**: Curated, high-quality labeled examples
* **Production bucket/collection**: All data with auto-labels

**When to use:**

* Need strict quality control on reference set
* Want to prevent noisy auto-labels from affecting matching
* Prefer to manually review before promoting items to reference

### Step 1: Create Reference Bucket and Collection

```bash theme={null}
# Reference bucket
POST /v1/buckets
{
  "bucket_name": "product_reference",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Reference collection (no taxonomy needed)
POST /v1/collections
{
  "collection_name": "product_reference",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_product_reference"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  }
}

# Create taxonomy retriever
POST /v1/retrievers
{
  "retriever_name": "product_classifier",
  "collection_identifiers": ["product_reference"],
  "stages": [
    {
      "stage_name": "labeled_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "conditions": {
            "AND": [
              { "field": "product_label", "operator": "exists", "value": true }
            ]
          }
        }
      }
    },
    {
      "stage_name": "image_match",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
              "query": { "input_mode": "content", "value": "{{INPUT.query_image}}" },
              "top_k": 1,
              "min_score": 0.30
            }
          ]
        }
      }
    }
  ]
}
```

### Step 2: Upload and Label Reference Set

Upload 50-100 curated images to the reference bucket and manually label them:

```bash theme={null}
# Upload to reference
POST /v1/buckets/bkt_product_reference/objects
{
  "metadata": { "product_label": null },
  "blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}

# Label them
PATCH /v1/buckets/bkt_product_reference/objects/{object_id}
{
  "metadata": { "product_label": "Red Running Shoes" }
}
```

### Step 3: Create Production Bucket and Collection

```bash theme={null}
# Production bucket
POST /v1/buckets
{
  "bucket_name": "product_catalog",
  "bucket_schema": {
    "properties": {
      "product_label": { "type": "text" },
      "image_url": { "type": "text" }
    }
  }
}

# Production collection with taxonomy
POST /v1/collections
{
  "collection_name": "product_catalog",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_product_catalog"]
  },
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image_url" },
    "field_passthrough": ["product_label"]
  },
  "taxonomy": {
    "retriever_id": "ret_product_classifier",
    "field_to_enrich": "product_label",
    "confidence_threshold": 0.30
  }
}
```

### Step 4: Upload Production Data

New uploads auto-label based on the reference set:

```bash theme={null}
POST /v1/buckets/bkt_product_catalog/objects
{
  "blobs": [{ "property": "image_url", "type": "image", "data": "https://example.com/image.jpg" }]
}
```

### Step 5: Promote High-Confidence Items to Reference

Periodically review production data and promote high-confidence matches:

```bash theme={null}
# Find high-confidence items
GET /v1/collections/product_catalog/documents?filters={
  "must": [
    {
      "key": "taxonomy_match.confidence",
      "match": { "operator": "gte", "value": 0.85 }
    }
  ]
}

# Copy to reference bucket
POST /v1/buckets/bkt_product_reference/objects
{
  "metadata": { "product_label": "..." },
  "blobs": [{ ... }]
}
```

***

## Real-World Examples

### Example 1: Face Recognition System

```bash theme={null}
# Create bucket for employee photos
POST /v1/buckets
{
  "bucket_name": "employee_photos",
  "bucket_schema": {
    "properties": {
      "person_name": { "type": "text" },
      "employee_id": { "type": "text" },
      "photo_url": { "type": "text" }
    }
  }
}

# Bootstrap collection with face extraction
POST /v1/collections
{
  "collection_name": "employee_faces",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_employee_photos"]
  },
  "feature_extractor": {
    "feature_extractor_name": "face_identity_extractor",
    "version": "v1",
    "input_mappings": { "image": "photo_url" },
    "field_passthrough": ["person_name", "employee_id"]
  }
}

# Upload 50 employee photos → manually label with names
# Create taxonomy retriever
# Security camera footage auto-identifies employees
```

### Example 2: Document Classification

```bash theme={null}
# Create bucket for documents
POST /v1/buckets
{
  "bucket_name": "company_documents",
  "bucket_schema": {
    "properties": {
      "document_type": { "type": "text" },
      "content": { "type": "text" }
    }
  }
}

# Bootstrap collection with text extraction
POST /v1/collections
{
  "collection_name": "document_types",
  "source": {
    "type": "bucket",
    "bucket_ids": ["bkt_company_documents"]
  },
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": { "text": "content" },
    "field_passthrough": ["document_type"]
  },
  "taxonomy": {
    "field_to_enrich": "document_type",
    "confidence_threshold": 0.35
  }
}

# Label 20 invoices, 20 contracts, 20 receipts
# New documents auto-classify by type
```

***

## Advanced Configuration

### Tuning Confidence Thresholds

The `confidence_threshold` determines how conservative auto-labeling is:

| Threshold   | Behavior     | Use Case                          |
| ----------- | ------------ | --------------------------------- |
| `0.20-0.25` | Aggressive   | High recall, more false positives |
| `0.30-0.35` | Balanced     | Good starting point               |
| `0.40-0.50` | Conservative | High precision, fewer auto-labels |
| `0.60+`     | Very strict  | Only exact matches                |

**Finding the right threshold:**

1. Start with `0.30`
2. Monitor false positive rate (wrong auto-labels)
3. Check coverage (% of items auto-labeled)
4. Adjust based on cost of errors:
   * **High cost of errors** (e.g., medical imaging) → Higher threshold
   * **Low cost of errors** (e.g., photo organization) → Lower threshold

### Monitoring & Analytics

Track performance with these queries:

```bash theme={null}
# Get distribution of labels
GET /v1/collections/{collection_id}/analytics/field-distribution?field=product_label

# Check match confidence distribution
GET /v1/collections/{collection_id}/documents?sort_by=taxonomy_match.confidence&limit=100

# Find low-confidence matches for review
GET /v1/collections/{collection_id}/documents?filters={
  "must": [
    {
      "key": "taxonomy_match.matched",
      "match": { "operator": "eq", "value": true }
    },
    {
      "key": "taxonomy_match.confidence",
      "match": { "operator": "lt", "value": 0.40 }
    }
  ]
}
```

**Key metrics:**

* **Auto-label coverage**: % of new items auto-labeled
* **Manual review queue**: # of items with `label: null`
* **Confidence distribution**: Are matches clustered around threshold?
* **False positive rate**: Sample and manually verify auto-labels

### Best Practices

**Reference set quality:**

* Include diverse examples (angles, lighting, backgrounds)
* Use consistent naming conventions
* Aim for balanced distribution across categories
* Maintain high-quality, unambiguous images

**Labeling guidelines:**

* Create a labeling style guide
* Consider hierarchical labels: `"Shoes > Running > Red"`
* Define rules for edge cases
* Version your taxonomy as it evolves

**Continuous improvement:**

* Review unknowns regularly
* Audit auto-labels periodically
* Add corrected examples when system makes mistakes
* Expand categories as needed

**Production deployment:**

* Start with conservative threshold (0.40+)
* Implement human-in-the-loop for critical applications
* Enable feedback mechanism for corrections
* A/B test threshold changes

***

## Troubleshooting

### Too many unlabeled items

**Causes**: Threshold too high, insufficient reference examples, new categories

**Solutions**:

* Lower `confidence_threshold` to 0.25-0.30
* Add 20+ examples per category to reference set
* Review and label new categories

### False positives (wrong labels)

**Causes**: Threshold too low, similar categories, poor quality references

**Solutions**:

* Raise `confidence_threshold` to 0.40+
* Add diverse examples to distinguish categories
* Clean up reference set

### System not self-improving

**Causes**: Labels not syncing, configuration issues

**Solutions**:

* Verify `field_passthrough` includes label field
* Check retriever filters for non-null labels
* Confirm bucket-to-collection sync is working

***

## Summary

**Workflow:**

1. Create bucket and collection with feature extraction
2. Upload unlabeled data (50-100 items)
3. Manually label reference set (10-20 per category)
4. Create taxonomy retriever pointing to labeled items
5. New uploads auto-label based on similarity
6. Review and label unknowns to improve system

**Key benefits:**

* Start with zero labels, build incrementally
* Automate repetitive labeling
* Self-improving with each manual correction
* Scales from dozens to millions

**Next steps:**

* Choose unified (simpler) or separate (more control) approach
* Start with 50-100 reference items
* Test different confidence thresholds (start at 0.30)
* Monitor auto-label quality and adjust

***

## Discover Clusters

Use clustering to find new categories before defining them manually:

```bash theme={null}
POST /v1/clusters
{
  "cluster_name": "product-discovery",
  "collection_ids": ["col_products_unified"],
  "cluster_type": "vector",
  "vector_config": {
    "feature_uris": ["mixpeek://image_extractor@v1/google_siglip_base_v1"],
    "clustering_method": "hdbscan",
    "hdbscan_parameters": { "min_cluster_size": 5 }
  },
  "llm_labeling": {
    "enabled": true,
    "input_mappings": [{ "source": "payload", "fields": ["product_label"] }]
  },
  "dimension_reduction": { "method": "umap", "n_components": 2 }
}
```

Clusters reveal groups you haven't labeled yet — "sandals", "boots", "athletic wear" — without predefined categories. Once a cluster stabilizes, promote it to a taxonomy node so future items auto-classify into it. See [Clusters](/enrichment/clusters).

## Set Up Alerts

Get notified when items fail to auto-label (unknown categories needing manual review). An alert **runs a retriever and fires on its results** — so first create a retriever that surfaces unlabeled items, then point an alert at it.

```bash theme={null}
# 1. Retriever that returns products with no label. attribute_filter as the
#    first stage fetches + filters straight from the collection (no search needed).
POST /v1/retrievers
{
  "retriever_name": "unlabeled-products",
  "collection_identifiers": ["col_products_unified"],
  "input_schema": {},
  "stages": [
    { "stage_name": "missing_label", "stage_type": "filter",
      "config": { "stage_id": "attribute_filter", "parameters": {
        "field": "product_label", "operator": "exists", "value": false } } }
  ]
}
# -> { "retriever_id": "ret_unlabeled" }

# 2. Fire a webhook whenever that retriever returns any results
POST /v1/alerts
{
  "name": "unknown-products",
  "source": "retriever",
  "retriever_id": "ret_unlabeled",
  "trigger_on": "results",
  "notification_config": {
    "channels": [
      { "channel_type": "webhook", "config": { "url": "https://example.com/webhook" } }
    ]
  }
}
```

See [Alerts](/enrichment/alerts) for system-metric alerts and Slack/email channels.

## Set Up Webhooks

Forward ingestion and labeling events to your own systems:

```bash theme={null}
POST /v1/organizations/webhooks
{
  "webhook_name": "labeling-events",
  "event_types": ["object.created", "object.updated"],
  "channels": [
    { "channel": "webhook", "configs": { "url": "https://example.com/webhook", "method": "POST" } }
  ]
}
```

`event_types` accepts object/collection/cluster/taxonomy/alert lifecycle events (e.g. `object.created`, `object.updated`, `object.deleted`, `collection.created`). See [Webhooks](/operations/webhooks) for the full event list and Slack/email channels.
