The practice of recording the complete history of data from its origin through all transformations and derivations. Data lineage is essential in multimodal AI systems for debugging, compliance, and understanding how processed outputs relate to raw inputs.
Data lineage systems track the flow of data through pipelines by recording provenance metadata at each processing step. For each data artifact, lineage records its source, the transformations applied, the code and configurations used, and the downstream artifacts it produced. This creates a directed acyclic graph (DAG) that can be traversed to understand how any output was derived from raw inputs.
Lineage is captured at different granularities: table-level (which tables feed which), column-level (field-level dependencies), and row-level (individual record tracking). Implementation approaches include explicit instrumentation in pipeline code, automatic extraction from query logs, and metadata APIs. Standards like OpenLineage provide a common format for lineage events. Storage uses graph databases or lineage-specific metadata stores.