A storage architecture that holds vast amounts of raw data in its native format until needed for analysis. Data lakes serve as the foundation for multimodal AI systems, storing diverse file types before processing into embeddings and structured metadata.
A data lake ingests and stores data in its original format without requiring schema definition upfront. Files of any type (images, videos, audio, documents, JSON, CSV) are stored in object storage organized by partitioning strategies. When data is needed, it is read and transformed on demand using schema-on-read approaches. This decouples storage from processing, allowing different tools to access the same raw data.
Data lakes typically use cloud object storage (S3, GCS, Azure Blob) as the storage layer. Table formats like Delta Lake, Apache Iceberg, and Apache Hudi add ACID transactions, schema evolution, and time travel on top of object storage. Catalog services (AWS Glue, Hive Metastore) provide metadata management. For multimodal data, the lake stores raw files while metadata databases track processing status and derived features.