Mixpeek Logo
    Content Processing

    Batch Processing

    Multi-tier DAG processing that transforms bucket objects into searchable documents with feature extraction

    Why do anything?

    Raw objects need feature extraction to become searchable. Without batch processing, you can't generate embeddings or enrich content at scale.

    Why now?

    AI search requires vector embeddings. Manual processing doesn't scale.

    Why this feature?

    Multi-tier DAG processing handles complex pipelines: Tier 0 (bucket→collection) and Tier N (collection→collection) with Celery workers and Ray inference.

    How It Works

    Batch processing uses a multi-tier DAG architecture. Tier 0 processes bucket objects, Tier N processes upstream collection outputs.

    1

    Batch Creation

    Create batch record, validate source and collection config

    2

    Task Routing

    Route to Celery process_tier queue based on tier level

    3

    Feature Extraction

    Ray engine runs feature extractor, generates embeddings

    4

    Document Storage

    Documents stored in Qdrant with vectors indexed

    Why This Approach

    DAG enables complex multi-stage pipelines (e.g., video→frames→faces→embeddings). Celery provides reliable task execution. Ray handles ML inference at scale.

    Integration

    batch = client.collections.trigger(collection_id=collection_id)