Mixpeek Logo
    TrainingDriftThemes

    Dataset Versioning

    Treat versioned object storage as your dataset's source of truth. Capture complete snapshots—raw assets, embeddings, and cluster assignments—for deterministic reconstruction at any point in time.

    video
    image
    audio
    text
    Production
    31.0K runs
    Deploy Recipe

    "Retrieve from with all and "

    Why This Matters

    When datasets stop slipping out from under you, everything downstream gets easier. True reproducibility means rebuilding exact training inputs, not reconstructing from memory.

    import requests
    API_URL = "https://api.mixpeek.com"
    headers = {"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}
    # Create collection linked to versioned object storage
    collection = requests.post(f"{API_URL}/v1/collections", headers=headers, json={
    "collection_name": "training_data_v2",
    "source": {"type": "bucket", "bucket_id": "versioned-training-data"},
    "feature_extractor": {
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1"
    }
    }).json()
    # Index from versioned object storage (e.g., Tigris, S3)
    requests.post(f"{API_URL}/v1/buckets/versioned-training-data/objects", headers=headers, json={
    "blobs": [{"property": "content", "url": "s3://bucket/training/"}],
    "metadata": {"version": "v2", "snapshot_date": "2024-10-01"}
    })
    # Create cluster snapshot for this version
    cluster = requests.post(f"{API_URL}/v1/clusters", headers=headers, json={
    "cluster_name": "training_v2_snapshot",
    "source_collection_ids": [collection["collection_id"]],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/embedding"],
    "algorithm": "hdbscan"
    }).json()
    # Execute to create snapshot
    execution = requests.post(
    f"{API_URL}/v1/clusters/{cluster['cluster_id']}/execute",
    headers=headers
    ).json()
    print(f"Snapshot created: {execution['run_id']}")
    # Query historical dataset state by filtering on metadata
    results = requests.post(
    f"{API_URL}/v1/retrievers/versioned-search/execute",
    headers=headers,
    json={"query": {"text": "product demos"}}
    ).json()
    print(f"Found {len(results['documents'])} documents")

    Feature Extractors

    Image Embedding

    Generate visual embeddings for similarity search and clustering

    752K runs

    Video Embedding

    Generate vector embeddings for video content

    610K runs

    Audio Transcription

    Transcribe audio content to text

    450K runs

    Text Embedding

    Extract semantic embeddings from documents, transcripts and text content

    827K runs

    Retriever Stages

    attribute filter

    Filter documents by metadata attributes

    filter

    feature search

    Search collections using multimodal embeddings

    search