The practice of maintaining historical versions of datasets, enabling reproducibility, rollback, and auditing. Data versioning is critical for multimodal AI systems where model training, evaluation, and compliance depend on knowing exactly which data was used.
Data versioning systems track changes to datasets by recording snapshots, diffs, or references to data files at specific points in time. When data is modified (added, updated, deleted), a new version is created while previous versions remain accessible. This enables teams to reproduce experiments, compare dataset changes, and roll back to previous states if issues are discovered.
Tools include DVC (Data Version Control, git-like for data), LakeFS (git-like branches for data lakes), and Delta Lake (versioned tables with time travel). DVC stores lightweight metadata files in git while actual data resides in remote storage. LakeFS provides branch, commit, and merge operations on object storage. Version metadata includes timestamps, authors, commit messages, and lineage information.