Dataset Versioning & Reproducibility
Immutable object versioning meets multimodal enrichment. Capture complete dataset snapshots—raw assets, derived artifacts, embeddings, and cluster assignments—for deterministic reconstruction at any point in time.
Why This Matters
When datasets stop slipping out from under you, everything downstream gets easier to reason about. True reproducibility isn't about best effort—it's about rebuilding, not reconstructing.
from mixpeek import Mixpeekclient = Mixpeek(api_key="your-api-key")# Create namespace for versioned training datanamespace = client.namespaces.create(namespace_name="training_data_v2")# Create collection with multimodal extractorscollection = client.collections.create(collection_name="training_data_v2",feature_extractors=[{"feature_extractor_name": "multimodal-embed","version": "v1"}])# Index objects from versioned object storage (e.g., Tigris)client.buckets.objects.create(bucket_id="raw-training-data",objects=[{"url": "s3://bucket/training_clip.mp4","collection_destination": collection.collection_id}])# Create cluster config for dataset snapshotscluster = client.clusters.create(cluster_name="q4_training_freeze",collection_id=collection.collection_id,algorithm="hdbscan")# Execute clustering to capture current stateexecution = client.clusters.execute(cluster_id=cluster.cluster_id)# Query dataset with time-based filtersresults = client.retrievers.execute(retriever_id="versioned-search",query={"text": "product demos"},filters={"created_at": {"$gte": "2024-10-01T00:00:00Z"}})# Compare cluster executions over timehistory = client.clusters.executions.list(cluster_id=cluster.cluster_id)print(f"Snapshots: {len(history.items)}")
Retrieval Flow
Filter by dataset version timestamp
Search within versioned snapshot
Reconstruct full dataset state from multiple collections
Tier 0 - Raw Signals
Direct extraction from source media
Tier 1 - Semantic
Derived text and structured data
Tier 2 - Aggregated
Embeddings and high-level features
Total: 4 extractors across 3 tiers
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
Video Embedding
Generate vector embeddings for video content
Audio Transcription
Transcribe audio content to text
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Retriever Stages
attribute filter
Filter documents by metadata attributes
feature search
Search collections using multimodal embeddings
compose
Compose multiple retriever pipelines together
Enrichment Resources
Studio Templates
Clone pre-configured templates directly into Mixpeek Studio
Dataset Snapshot Manager
Create and manage versioned snapshots with full lineage tracking
Training Data Auditor
Compare dataset versions and audit what changed between training runs
