Dataset Versioning
Treat versioned object storage as your dataset's source of truth. Capture complete snapshots—raw assets, embeddings, and cluster assignments—for deterministic reconstruction at any point in time.
"Retrieve training dataset snapshot from October 1st with all embeddings and cluster assignments"
Why This Matters
When datasets stop slipping out from under you, everything downstream gets easier. True reproducibility means rebuilding exact training inputs, not reconstructing from memory.
import requestsAPI_URL = "https://api.mixpeek.com"headers = {"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}# Create collection linked to versioned object storagecollection = requests.post(f"{API_URL}/v1/collections", headers=headers, json={"collection_name": "training_data_v2","source": {"type": "bucket", "bucket_id": "versioned-training-data"},"feature_extractor": {"feature_extractor_name": "multimodal_extractor","version": "v1"}}).json()# Index from versioned object storage (e.g., Tigris, S3)requests.post(f"{API_URL}/v1/buckets/versioned-training-data/objects", headers=headers, json={"blobs": [{"property": "content", "url": "s3://bucket/training/"}],"metadata": {"version": "v2", "snapshot_date": "2024-10-01"}})# Create cluster snapshot for this versioncluster = requests.post(f"{API_URL}/v1/clusters", headers=headers, json={"cluster_name": "training_v2_snapshot","source_collection_ids": [collection["collection_id"]],"feature_addresses": ["mixpeek://multimodal_extractor@v1/embedding"],"algorithm": "hdbscan"}).json()# Execute to create snapshotexecution = requests.post(f"{API_URL}/v1/clusters/{cluster['cluster_id']}/execute",headers=headers).json()print(f"Snapshot created: {execution['run_id']}")# Query historical dataset state by filtering on metadataresults = requests.post(f"{API_URL}/v1/retrievers/versioned-search/execute",headers=headers,json={"query": {"text": "product demos"}}).json()print(f"Found {len(results['documents'])} documents")
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
Video Embedding
Generate vector embeddings for video content
Audio Transcription
Transcribe audio content to text
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Retriever Stages
attribute filter
Filter documents by metadata attributes
feature search
Search collections using multimodal embeddings
