A series of automated data processing steps that move, transform, and enrich data from source to destination. Data pipelines are the backbone of multimodal AI systems, orchestrating the flow from raw files through feature extraction to searchable indices.
A data pipeline defines a sequence of stages that data passes through: ingestion from sources (APIs, object storage, databases), transformation (parsing, normalization, enrichment), processing (feature extraction, embedding generation), and loading into target systems (vector databases, search indices). Each stage can be configured independently and pipelines run on schedules or triggers.
Pipeline orchestrators include Apache Airflow, Prefect, Dagster, and custom Celery-based systems. Stages communicate via message queues (Redis, RabbitMQ) or shared storage (S3). For multimodal data, pipelines handle heterogeneous file types (video, audio, images, documents) and route each to appropriate processing services. Monitoring includes throughput, latency, error rates, and data quality metrics per stage.