A data integration pattern that extracts data from sources, transforms it into a usable format, and loads it into a destination system. ETL is fundamental to preparing multimodal data for AI processing, converting raw files into structured, searchable content.
The Extract phase pulls data from diverse sources (databases, APIs, file systems, object storage). The Transform phase cleans, normalizes, enriches, and restructures the data for its intended use. The Load phase writes the processed data to the target system (data warehouse, vector database, search index). Modern variants like ELT load raw data first and transform it within the destination system.
ETL tools include dbt (transform-focused), Apache Spark (distributed processing), and Airbyte (extraction). For multimodal AI, extraction handles diverse file formats (MP4, JPEG, PDF, WAV), transformation includes feature extraction and embedding generation, and loading writes to vector databases (Qdrant, Pinecone) and metadata stores (MongoDB). Batch ETL runs on schedules while streaming ETL processes data continuously.