Reproducible Multimodal Datasets, Without Losing Your Mind

Imagine you’re retraining a model that worked great three weeks ago.

Same code.
Same hyperparameters.
Same architecture.

But the results are… different.

Not wildly different. Just enough to make you uneasy.

Was it a model change?
A preprocessing tweak?
A labeling update?

Or did the underlying dataset quietly shift without anyone noticing?

**two “training runs” diverging despite identical code**

For most ML teams working with images, video, audio, or documents, this moment is painfully familiar. The hard part isn’t training models—it’s understanding exactly what data went into them, and whether you can ever get that dataset back again.

The real problem: datasets don’t stand still

Multimodal ML pipelines are inherently dynamic:

New raw assets are ingested continuously
Extractors generate derived clips, embeddings, transcripts, and metadata
Labels evolve as taxonomies change
Clusters shift as new data arrives
Models are retrained, evaluated, and rolled forward

Over time, what we casually call “the dataset” is actually a moving target.

**raw data → extractors → labels → retraining, with subtle drift over time**

Most teams try to manage this with a mix of folder conventions, timestamps, and best intentions. But as scale grows, small inconsistencies compound. Reproducing last month’s training set becomes guesswork. Audits turn into archaeology.

Treating object storage as dataset lineage

This is where Tigris changes the baseline.

Tigris provides immutable, versioned object storage with fast listing and rich metadata—while remaining fully S3-compatible. Instead of thinking in terms of “latest files,” teams can reason about states over time.

Every object version captures:

What existed
When it changed
What state it was in at any point

**multiple versions of the same asset with timestamps**

Object storage stops being a dumping ground. It becomes a durable system of record.

“Once you start treating object storage as a system of record, rather than just a place to put files, reproducibility becomes a property of the system."
— Ovais Tariq, CEO, Tigris Data

Where Mixpeek fits in

Mixpeek builds directly on top of that foundation.

Rather than treating datasets as flat files, Mixpeek models multimodal data as a progression:

**raw video/image/PDF → document abstraction → embeddings, transcripts, labels, clusters**

For each dataset version, Mixpeek persists:

Original raw assets
Derived artifacts (clips, page segments)
Extractor outputs (JSON, embeddings, transcripts)
Cluster assignments and label manifests

Because these artifacts live as versioned objects in Tigris, any dataset snapshot can be reconstructed deterministically.

Rebuilding a dataset, not reconstructing it

Need to retrain a model from six weeks ago?
Need to explain why a prediction changed?
Need to audit what data powered a decision?

You don’t reverse-engineer the past.

You rebuild it.

**“rebuild dataset” flow — select timestamp → fetch object versions → regenerate dataset**

This shifts reproducibility from “best effort” to default behavior.

Why this matters in practice

For ML and platform teams, this unlocks:

True reproducibility — dataset snapshots, not guesses
Safer retraining — re-runs start from known inputs
Clear audit trails — derived signals trace back to raw objects
Lower operational overhead — no bespoke versioning logic

**ad-hoc pipelines vs versioned datasets**

The result isn’t just cleaner pipelines—it’s calmer ones.

💡

Want to stop guessing what went into your last training run?

The recipe walks through the full implementation—from raw asset ingestion to rebuildable snapshots.

→ View the Recipe

A small shift with outsized impact

There’s no new model architecture here.
No exotic optimization trick.

Just a reframing:

What if object storage was the source of truth for dataset lineage—and everything else was derived?

**Tigris at the base, Mixpeek above, models at the top**

By combining Tigris’s immutable object versions with Mixpeek’s multimodal enrichment and retrieval layer, dataset versioning stops being an afterthought and becomes the default.

And when datasets stop slipping out from under you, everything downstream gets easier to reason about.