Mixpeek Logo
    DatasetDrift

    Dataset Versioning & Reproducibility

    Immutable object versioning meets multimodal enrichment. Capture complete dataset snapshots—raw assets, derived artifacts, embeddings, and cluster assignments—for deterministic reconstruction at any point in time.

    video
    image
    audio
    text
    Production
    31.0K runs
    Deploy Recipe

    Why This Matters

    When datasets stop slipping out from under you, everything downstream gets easier to reason about. True reproducibility isn't about best effort—it's about rebuilding, not reconstructing.

    from mixpeek import Mixpeek
    client = Mixpeek(api_key="your-api-key")
    # Create namespace for versioned training data
    namespace = client.namespaces.create(
    namespace_name="training_data_v2"
    )
    # Create collection with multimodal extractors
    collection = client.collections.create(
    collection_name="training_data_v2",
    feature_extractors=[{
    "feature_extractor_name": "multimodal-embed",
    "version": "v1"
    }]
    )
    # Index objects from versioned object storage (e.g., Tigris)
    client.buckets.objects.create(
    bucket_id="raw-training-data",
    objects=[{
    "url": "s3://bucket/training_clip.mp4",
    "collection_destination": collection.collection_id
    }]
    )
    # Create cluster config for dataset snapshots
    cluster = client.clusters.create(
    cluster_name="q4_training_freeze",
    collection_id=collection.collection_id,
    algorithm="hdbscan"
    )
    # Execute clustering to capture current state
    execution = client.clusters.execute(cluster_id=cluster.cluster_id)
    # Query dataset with time-based filters
    results = client.retrievers.execute(
    retriever_id="versioned-search",
    query={"text": "product demos"},
    filters={"created_at": {"$gte": "2024-10-01T00:00:00Z"}}
    )
    # Compare cluster executions over time
    history = client.clusters.executions.list(cluster_id=cluster.cluster_id)
    print(f"Snapshots: {len(history.items)}")

    Retrieval Flow

    1

    Filter by dataset version timestamp

    2

    Search within versioned snapshot

    3
    compose(compose)

    Reconstruct full dataset state from multiple collections

    Tier 0 - Raw Signals

    Direct extraction from source media

    Tier 1 - Semantic

    Derived text and structured data

    Tier 2 - Aggregated

    Embeddings and high-level features

    Total: 4 extractors across 3 tiers

    Feature Extractors

    Image Embedding

    Generate visual embeddings for similarity search and clustering

    752K runs

    Video Embedding

    Generate vector embeddings for video content

    610K runs

    Audio Transcription

    Transcribe audio content to text

    450K runs

    Text Embedding

    Extract semantic embeddings from documents, transcripts and text content

    827K runs

    Retriever Stages

    attribute filter

    Filter documents by metadata attributes

    filter

    feature search

    Search collections using multimodal embeddings

    search

    compose

    Compose multiple retriever pipelines together

    compose

    Enrichment Resources

    Clustering

    HDBSCAN Clustering

    Density-based clustering for anomaly detection and grouping

    Cluster snapshots tied to dataset versions

    Taxonomy

    Content Status Taxonomy

    Labels content as current, deprecated, or unknown

    Label manifests frozen at version time

    Analytics

    Drift Analytics

    Monitor distribution shifts and content evolution

    Track distribution shifts between versions

    Audit Logs

    Compliance and access audit trail

    Object-to-artifact lineage tracking

    Studio Templates

    Clone pre-configured templates directly into Mixpeek Studio

    Dataset Snapshot Manager

    Create and manage versioned snapshots with full lineage tracking

    Clone in Studio

    Training Data Auditor

    Compare dataset versions and audit what changed between training runs

    Clone in Studio