Quick Diagnosis
Call the diagnose endpoint to get a complete picture of a batch’s health:- status and failure_category — programmatic failure classification (infrastructure, timeout, orphaned, pipeline)
- infrastructure_events — OOM, preemption, node failures, Ray bugs with timestamps
- per_tier — timing, submission params, and resource details per tier
- failed_documents_sample — first 10 failed documents with error details
- recommendations — actionable next steps based on the failure type
Common Failure Scenarios
Out of Memory (OOM)
The diagnose endpoint will showfailure_category: "infrastructure" with an infrastructure event of type oom.
Fix: Retry the failed tier with more resources:
Stuck Job
If a tier showsIN_PROGRESS but last_activity_at is stale (minutes old), the job may be stuck.
Fix: Run stuck detection, then cancel the stuck tier:
Duplicate Jobs
If multiple Ray jobs are running for the same extractor in a tier:Cancel a Single Job (Not the Whole Batch)
If one extractor job in a multi-extractor tier is failing but others are fine:Submission Parameters
Every batch tier now persists its submission parameters — the resources, GPU setting, plugins, and entrypoint used when the job was submitted. Access them in the batch response:Stage Timing Breakdown
The batch progress now includesstage_history — a timing breakdown of each completed processing stage:

