Reproducible Multimodal Datasets, Without Losing Your Mind
Reproducible multimodal datasets using immutable object storage. Learn how Tigris and Mixpeek enable dataset versioning, retraining, and auditability.

Imagine you’re retraining a model that worked great three weeks ago.
Same code.
Same hyperparameters.
Same architecture.
But the results are… different.
Not wildly different. Just enough to make you uneasy.
Was it a model change?
A preprocessing tweak?
A labeling update?
Or did the underlying dataset quietly shift without anyone noticing?

For most ML teams working with images, video, audio, or documents, this moment is painfully familiar. The hard part isn’t training models—it’s understanding exactly what data went into them, and whether you can ever get that dataset back again.
The real problem: datasets don’t stand still
Multimodal ML pipelines are inherently dynamic:
- New raw assets are ingested continuously
- Extractors generate derived clips, embeddings, transcripts, and metadata
- Labels evolve as taxonomies change
- Clusters shift as new data arrives
- Models are retrained, evaluated, and rolled forward
Over time, what we casually call “the dataset” is actually a moving target.

Most teams try to manage this with a mix of folder conventions, timestamps, and best intentions. But as scale grows, small inconsistencies compound. Reproducing last month’s training set becomes guesswork. Audits turn into archaeology.
Treating object storage as dataset lineage
This is where Tigris changes the baseline.
Tigris provides immutable, versioned object storage with fast listing and rich metadata—while remaining fully S3-compatible. Instead of thinking in terms of “latest files,” teams can reason about states over time.
Every object version captures:
- What existed
- When it changed
- What state it was in at any point

Object storage stops being a dumping ground. It becomes a durable system of record.
“Once you start treating object storage as a system of record, rather than just a place to put files, reproducibility becomes a property of the system."
— Ovais Tariq, CEO, Tigris Data
Where Mixpeek fits in
Mixpeek builds directly on top of that foundation.
Rather than treating datasets as flat files, Mixpeek models multimodal data as a progression:

For each dataset version, Mixpeek persists:
- Original raw assets
- Derived artifacts (clips, page segments)
- Extractor outputs (JSON, embeddings, transcripts)
- Cluster assignments and label manifests
Because these artifacts live as versioned objects in Tigris, any dataset snapshot can be reconstructed deterministically.
Rebuilding a dataset, not reconstructing it
Need to retrain a model from six weeks ago?
Need to explain why a prediction changed?
Need to audit what data powered a decision?
You don’t reverse-engineer the past.
You rebuild it.

This shifts reproducibility from “best effort” to default behavior.
Why this matters in practice
For ML and platform teams, this unlocks:
- True reproducibility — dataset snapshots, not guesses
- Safer retraining — re-runs start from known inputs
- Clear audit trails — derived signals trace back to raw objects
- Lower operational overhead — no bespoke versioning logic

The result isn’t just cleaner pipelines—it’s calmer ones.
We’ve packaged this pattern as a reusable Mixpeek recipe—showing how to version multimodal datasets using immutable object storage, enriched feature graphs, and rebuildable snapshots.
→ Dataset Versioning & Reproducibility Recipe
A small shift with outsized impact
There’s no new model architecture here.
No exotic optimization trick.
Just a reframing:
What if object storage was the source of truth for dataset lineage—and everything else was derived?

By combining Tigris’s immutable object versions with Mixpeek’s multimodal enrichment and retrieval layer, dataset versioning stops being an afterthought and becomes the default.
And when datasets stop slipping out from under you, everything downstream gets easier to reason about.
