Batch Processing
Multi-tier DAG processing that transforms bucket objects into searchable documents with feature extraction
Why do anything?
Raw objects need feature extraction to become searchable. Without batch processing, you can't generate embeddings or enrich content at scale.
Why now?
AI search requires vector embeddings. Manual processing doesn't scale.
Why this feature?
Multi-tier DAG processing handles complex pipelines: Tier 0 (bucket→collection) and Tier N (collection→collection) with Celery workers and Ray inference.
How It Works
Batch processing uses a multi-tier DAG architecture. Tier 0 processes bucket objects, Tier N processes upstream collection outputs.
Batch Creation
Create batch record, validate source and collection config
Task Routing
Route to Celery process_tier queue based on tier level
Feature Extraction
Ray engine runs feature extractor, generates embeddings
Document Storage
Documents stored in Qdrant with vectors indexed
Why This Approach
DAG enables complex multi-stage pipelines (e.g., video→frames→faces→embeddings). Celery provides reliable task execution. Ray handles ML inference at scale.
