Dataset Audit & Drift Detection
Detects bias, gaps, duplication, and distribution shifts using baseline clustering snapshots. This is where infrastructure buyers lean in hard.
"Detect distribution drift in training data between Q1 baseline and current dataset with 15% alert threshold"
Why This Matters
Drift detection is operational monitoring—not a one-time audit. Baseline clusters become the source of truth for data quality over time.
from mixpeek import Mixpeekclient = Mixpeek(api_key="your-api-key")# Create baseline cluster snapshotbaseline = client.analytics.cluster(collection_id="training_data",algorithm="hdbscan",snapshot_id="baseline_2024_q1")# Detect drift in new datadrift_report = client.analytics.drift_detection(collection_id="training_data",baseline_snapshot="baseline_2024_q1",current_period={"start": "2024-10-01","end": "2024-12-31"},alert_threshold=0.15)# Find outliers (potential novelty or data quality issues)outliers = client.retrievers.execute(retriever_id="outlier-search",inputs={"drift_score_min": 0.8,"cluster_id": None # Unassigned to any cluster})
Feature Extractors
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Video Embedding
Generate vector embeddings for video content
