TrainingDriftThemes

Dataset Versioning

Treat versioned object storage as your dataset's source of truth. Capture complete snapshots—raw assets, embeddings, and cluster assignments—for deterministic reconstruction at any point in time.

video

image

audio

text

Production

31.0K runs

Deploy Recipe

"Retrieve training dataset snapshot from October 1st with all embeddings and cluster assignments"

Why This Matters

When datasets stop slipping out from under you, everything downstream gets easier. True reproducibility means rebuilding exact training inputs, not reconstructing from memory.

import requests

API_URL = "https://api.mixpeek.com"
headers = {"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}

# Create collection linked to versioned object storage
collection = requests.post(f"{API_URL}/v1/collections", headers=headers, json={
    "collection_name": "training_data_v2",
    "source": {"type": "bucket", "bucket_id": "versioned-training-data"},
    "feature_extractor": {
        "feature_extractor_name": "multimodal_extractor",
        "version": "v1"
    }
}).json()

# Index from versioned object storage (e.g., Tigris, S3)
requests.post(f"{API_URL}/v1/buckets/versioned-training-data/objects", headers=headers, json={
    "blobs": [{"property": "content", "url": "s3://bucket/training/"}],
    "metadata": {"version": "v2", "snapshot_date": "2024-10-01"}
})

# Create cluster snapshot for this version
cluster = requests.post(f"{API_URL}/v1/clusters", headers=headers, json={
    "cluster_name": "training_v2_snapshot",
    "source_collection_ids": [collection["collection_id"]],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/embedding"],
    "algorithm": "hdbscan"
}).json()

# Execute to create snapshot
execution = requests.post(
    f"{API_URL}/v1/clusters/{cluster['cluster_id']}/execute",
    headers=headers
).json()
print(f"Snapshot created: {execution['run_id']}")

# Query historical dataset state by filtering on metadata
results = requests.post(
    f"{API_URL}/v1/retrievers/versioned-search/execute",
    headers=headers,
    json={"query": {"text": "product demos"}}
).json()

print(f"Found {len(results['documents'])} documents")

Feature Extractors

Image Embedding

Generate visual embeddings for similarity search and clustering

752K runs

Video Embedding

Generate vector embeddings for video content

610K runs

Audio Transcription

Transcribe audio content to text

450K runs

Text Embedding

Extract semantic embeddings from documents, transcripts and text content

827K runs

Retriever Stages

attribute filter

Filter documents by metadata attribute values using boolean logic

filter

feature search

Search and filter documents by vector similarity using feature embeddings

filter

Resources Used

Clustering

Version Clusters

Cluster snapshots tied to dataset versions

Analytics

Version Metrics

Track changes between dataset versions

Track changes between versions

Documentation

Clusters Collections

Dataset Versioning

Why This Matters

Feature Extractors

Retriever Stages

Resources Used

Related Blog Posts

Documentation

Related Recipes & Resources

Video Embedding

Audio Transcription

Text Embedding

Image Embedding

Video Embedding

Image Embedding