Content Processing

Batch Processing

Multi-tier DAG processing that transforms bucket objects into searchable documents with feature extraction

Why do anything?

Raw objects need feature extraction to become searchable. Without batch processing, you can't generate embeddings or enrich content at scale.

Why now?

AI search requires vector embeddings. Manual processing doesn't scale.

Why this feature?

Multi-tier DAG processing handles complex pipelines: Tier 0 (bucket→collection) and Tier N (collection→collection) with Celery workers and Ray inference.

How It Works

Batch processing uses a multi-tier DAG architecture. Tier 0 processes bucket objects, Tier N processes upstream collection outputs.

Batch Creation

Create batch record, validate source and collection config

Task Routing

Route to Celery process_tier queue based on tier level

Feature Extraction

Ray engine runs feature extractor, generates embeddings

Document Storage

Documents stored in Qdrant with vectors indexed

Why This Approach

DAG enables complex multi-stage pipelines (e.g., video→frames→faces→embeddings). Celery provides reliable task execution. Ray handles ML inference at scale.

Where This Is Used

Media

Large-Scale Video Processing

E-commerce

Product Catalog Indexing

Integration

batch = client.collections.trigger(collection_id=collection_id)

View Documentation

Related Capabilities

prerequisite

Objects

Batches process objects

often combined

Text Extractor

Feature extractors run in batches