Create a new batch for grouping bucket objects.
REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.
"Bearer YOUR_API_KEY"
"Bearer YOUR_STRIPE_API_KEY"
REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'
"ns_abc123def456"
"production"
"my-namespace"
The unique identifier of the bucket.
Skip object existence validation. Use this for large batches (>10k objects) or when you're certain all object IDs are valid. Improves performance significantly.
Request model for creating a new batch.
Batches group bucket objects for processing into collections. When you submit a batch, all objects in the batch are processed through the collections associated with the bucket.
Batch Processing Flow:
Examples: Single object batch: {"object_ids": ["obj_123"]}
Multiple objects batch:
{"object_ids": ["obj_123", "obj_456", "obj_789"]}REQUIRED. List of object IDs to include in the batch. Objects must exist in the bucket where the batch is created. Minimum 1 object, no maximum limit. All objects will be processed when the batch is submitted. Collections with collection sources (decomposition trees) are processed automatically via DAG resolution - no need to create separate batches.
1["object_789", "object_101"][
"obj_video_001",
"obj_video_002",
"obj_video_003"
]Successful Response
Model representing a batch of objects for processing through collections.
A batch groups bucket objects together for processing through one or more collections. Batches support multi-tier processing where collections are processed in dependency order (e.g., bucket → chunks → frames → scenes). Each tier has independent task tracking.
Use Cases: - Process multiple objects through collections in a single batch - Track progress of multi-tier decomposition pipelines - Monitor and retry individual processing tiers - Query batch status and tier-specific task information
Lifecycle: 1. Created in DRAFT status with object_ids 2. Submitted for processing → status changes to PENDING 3. Each tier processes sequentially (tier 0 → tier 1 → ... → tier N) 4. Batch completes when all tiers finish (status=COMPLETED) or any tier fails (status=FAILED)
Multi-Tier Processing: - Tier 0: Bucket objects → Collections (bucket as source) - Tier N (N > 0): Collection documents → Collections (upstream collection as source) - Each tier gets independent task tracking via tier_tasks array - Processing proceeds tier-by-tier with automatic chaining
Requirements: - batch_id: OPTIONAL (auto-generated if not provided) - bucket_id: REQUIRED - status: OPTIONAL (defaults to DRAFT) - object_ids: REQUIRED for processing (must have at least 1 object) - collection_ids: OPTIONAL (discovered via DAG resolution) - tier_tasks: OPTIONAL (populated during processing) - current_tier: OPTIONAL (set during processing) - total_tiers: OPTIONAL (defaults to 1, set during DAG resolution) - dag_tiers: OPTIONAL (populated during DAG resolution)
REQUIRED. Unique identifier of the bucket containing the objects to process. Must be a valid bucket ID that exists in the system. All object_ids must belong to this bucket. Format: Bucket ID as defined when bucket was created.
"bkt_videos"
"bkt_documents_q4"
OPTIONAL (auto-generated if not provided). Unique identifier for this batch. Format: 'btch_' prefix followed by 12-character secure token. Generated using generate_secure_token() from shared.utilities.helpers. Used to query batch status and track processing across tiers. Immutable after creation.
"btch_abc123xyz789"
"btch_video_batch_01"
OPTIONAL (defaults to DRAFT). Current processing status of the batch. Lifecycle: DRAFT → PENDING → IN_PROGRESS → COMPLETED/FAILED. DRAFT: Batch created but not yet submitted. PENDING: Batch submitted and queued for processing. IN_PROGRESS: Batch currently processing (one or more tiers active). COMPLETED: All tiers successfully completed. FAILED: One or more tiers failed. Aggregated from tier_tasks statuses during multi-tier processing.
PENDING, QUEUED, IN_PROGRESS, PROCESSING, COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELED, UNKNOWN, SKIPPED, DRAFT, ACTIVE, ARCHIVED, SUSPENDED "DRAFT"
"PENDING"
"IN_PROGRESS"
"COMPLETED"
"FAILED"
List of object IDs to include in this batch. All objects must exist in the specified bucket_id. These objects are the source data for tier 0 processing. Collection-sourced batches may have empty object_ids. Objects are processed in parallel within each tier.
["obj_video_001", "obj_video_002"]["obj_doc_123"]OPTIONAL. List of all collection IDs involved in this batch's processing. Automatically populated during DAG resolution from dag_tiers. Includes collections from all tiers (flattened view of dag_tiers). Used for quick lookups without traversing tier structure. Format: List of collection IDs across all tiers.
["col_chunks"]OPTIONAL. Legacy error message field for backward compatibility. None if batch succeeded or is still processing. Contains human-readable error description from first failed tier. DEPRECATED: Use tier_tasks[].errors for detailed error information. For multi-tier batches, typically contains the error from the first failed tier. Check tier_tasks array for tier-specific error details and error_summary for aggregation.
"Failed to process batch: Object not found"
OPTIONAL. Human-readable explanation of why the batch failed. None if batch succeeded, is still processing, or is in DRAFT/PENDING state. Populated automatically when a batch transitions to FAILED status. Provides a concise, actionable summary of the root cause. Common reasons include: Ray job failure (spot preemption, OOM, code errors), 0 documents written (processing completed but produced no output), processing stall (no activity detected for extended period), or Celery task exception (submission/validation failures). Use this field for user-facing error displays and alerting.
"Ray job failed: ImportError: No module named 'google.genai'"
OPTIONAL. Aggregated summary of errors across ALL tiers in the batch. None if batch succeeded or is still processing. Maps error_type (category) to total count of affected documents across all tiers. Provides quick batch-wide overview of error distribution. Example: {'dependency': 15, 'authentication': 25, 'validation': 5} means across all tiers, 15 documents failed with dependency errors, 25 with auth errors, 5 with validation errors. Automatically aggregated from tier_tasks[].error_summary. Used for batch health dashboard and error trend analysis.
null
OPTIONAL (defaults to BUCKET). Type of batch. BUCKET: Standard batch processing bucket objects through collections. COLLECTION: Reserved for future collection-only batch processing. Currently only BUCKET type is supported.
BUCKET, COLLECTION "BUCKET"
OPTIONAL. S3 key where the batch manifest is stored. Contains metadata and row data (Parquet) for Engine processing. For tier 0, points to bucket object manifest. For tier N+, points to collection document manifest. Format: S3 path (e.g., 'namespace_id/internal_id/manifests/tier_0.parquet'). Generated during batch submission.
"ns_abc/org_123/manifests/tier_0.parquet"
OPTIONAL. Primary task ID for the batch (typically tier 0 task). Used for backward compatibility with single-tier batch tracking. For multi-tier batches, prefer querying tier_tasks array for granular tracking. Format: Task ID as generated for tier 0.
"task_tier0_abc123"
OPTIONAL. List of object IDs that were successfully validated and loaded into the batch. Subset of object_ids that passed validation. Used to track which objects are ready for processing. None if batch hasn't been validated yet.
["obj_video_001", "obj_video_002"]OPTIONAL. Internal engine/job metadata for system use. May contain: job_id (provider-specific), engine_version, processing hints, last_health_check. last_health_check: Most recent health check results with health_status, enriched_documents, vector_populated_count, stall_duration_seconds, recommendations, missing_features. Populated asynchronously via Celery task (non-blocking, best-effort). Used for troubleshooting batch processing issues via API. NOTE: In MongoDB, this is stored under '_internal.processing' path.
{
"include_history": true,
"last_health_check": {
"enriched_documents": 98,
"health_status": "HEALTHY",
"missing_features": ["text_embedding"],
"processed_documents": 100,
"recommendations": [],
"stall_duration_seconds": 0,
"timestamp": "2025-11-06T10:05:00Z",
"total_documents": 100,
"vector_populated_count": 98
}
}OPTIONAL. User-defined metadata for the batch. Arbitrary key-value pairs for user tracking and organization. Persisted with the batch and returned in API responses. Not used by the system for processing logic. Examples: campaign_id, user_email, processing_notes.
{
"campaign_id": "Q4_2025",
"priority": "high"
}{
"project": "video_analysis",
"user_email": "user@example.com"
}OPTIONAL. List of tier task tracking information for multi-tier processing. Each element represents one tier in the processing pipeline. Empty array for simple single-tier batches. Populated during batch submission with tier 0 info, then appended as tiers progress. Each TierTaskInfo contains: tier_num, task_id, status, collection_ids, timestamps. Used for granular monitoring: 'Show me status of tier 2' or 'Retry tier 1'. Array index typically matches tier_num (tier_tasks[0] = tier 0, tier_tasks[1] = tier 1, etc.).
[][
{
"collection_ids": ["col_chunks"],
"status": "COMPLETED",
"task_id": "task_tier0_abc",
"tier_num": 0
}
]OPTIONAL. Zero-based index of the currently processing tier. None if batch hasn't started processing (status=DRAFT or PENDING). Updated as batch progresses through tiers. Used to show processing progress: 'Processing tier 2 of 5'. Set to last tier number when batch completes. Example: If processing tier 1 (frames), current_tier=1.
x >= 00
OPTIONAL (defaults to 1). Total number of tiers in the collection DAG. Minimum 1 (tier 0 only = bucket → collection). Set during DAG resolution when batch is submitted. Equals len(dag_tiers) if dag_tiers is populated. Used to calculate progress: current_tier / total_tiers. Example: 5-tier pipeline (bucket → chunks → frames → scenes → summaries) has total_tiers=5.
x >= 11
3
5
OPTIONAL. Complete DAG tier structure for this batch. List of tiers, where each tier is a list of collection IDs to process at that stage. Tier 0 = bucket-sourced collections. Tier N (N > 0) = collection-sourced collections. Collections within same tier have no dependencies (can run in parallel). Collections in tier N+1 depend on collections in tier N. Populated during DAG resolution at batch submission. Used for tier-by-tier processing orchestration. Example: [['col_chunks'], ['col_frames', 'col_objects'], ['col_scenes']] = 3 tiers where frames and objects run in parallel at tier 1.
[["col_chunks"]]OPTIONAL (auto-set on creation). ISO 8601 timestamp when batch was created. Set using current_time() from shared.utilities.helpers. Immutable after creation. Used for batch age tracking and cleanup of old batches.
"2025-11-03T10:00:00Z"
OPTIONAL. Live progress snapshot updated approximately every 10 seconds while the batch is IN_PROGRESS. Written by the Ray ProgressActor inside the engine job. None when status is DRAFT or PENDING (job not started), or after COMPLETED/FAILED. Use this to show real-time progress bars: processed/total objects, percent complete, throughput (items_per_second), and estimated time remaining (eta_seconds).
null
OPTIONAL. Computed health status for actively processing batches. Only populated when status is PROCESSING or IN_PROGRESS. Values: 'healthy' (recent activity detected), 'stalled' (no activity for 5+ minutes), 'unknown' (no heartbeat data yet). Computed from tier_tasks[].last_activity_at and updated_at. Use this to detect stuck batches before the internal stall detector kills them.
"healthy"
OPTIONAL. Timestamp of the most recent activity across all tier tasks. Aggregated from tier_tasks[].last_activity_at — the latest heartbeat from any tier. Updated approximately every 10 seconds by the BatchJobPoller while processing. A stale value (minutes old) while status is PROCESSING indicates the batch may be stalled. None for batches that have not started processing or have no heartbeat data.
OPTIONAL (defaults to 0). Number of times this batch has been auto-retried due to transient infrastructure failures (spot node preemption, OOM, actor death). Incremented each time the batch is automatically requeued after a retryable failure. User-facing: lets users see that retries happened transparently.
x >= 00
1
2
3
OPTIONAL (defaults to 3). Maximum number of automatic retries for transient failures. When retry_count reaches max_retries, the batch stays in FAILED state. Only transient/infrastructure failures trigger retries — validation and data errors do not.
x >= 03
5
OPTIONAL. ISO 8601 timestamp of the most recent auto-retry attempt. None if the batch has never been retried. Used to calculate exponential backoff for subsequent retries.
null
OPTIONAL. Human-readable reason for the most recent auto-retry. None if the batch has never been retried. Describes the transient failure that triggered the retry (e.g., 'Spot node preempted', 'Ray actor died', 'OOM killed').
null
OPTIONAL. URL to receive an HTTP POST notification when the batch reaches a terminal state (COMPLETED, FAILED, or CANCELED). Set at submit time via SubmitBatchRequest. The webhook is fire-and-forget: delivery failures are logged but never affect batch processing.
"https://example.com/webhooks/batch-complete"
OPTIONAL (auto-updated). ISO 8601 timestamp when batch was last modified. Updated using current_time() whenever batch status or tier_tasks change. Used to track batch activity and identify stale batches.
"2025-11-03T10:30:00Z"
COMPUTED. Human-readable description of the current batch state. Examples: 'Processing 724/50,000 objects (1.4%)', 'Queued — 2 batches ahead', 'Completed in 5m 23s', 'Loading model (stage 1/3)'. Computed on read, not stored in the database.
COMPUTED. Estimated completion timestamp based on current throughput. Derived from progress.eta_seconds + now. None if throughput data is unavailable. Computed on read, not stored in the database.