Batch Diagnostics & Troubleshooting

The Mixpeek API provides complete observability into batch processing jobs. You can diagnose issues, cancel stuck jobs, retry failed tiers with modified resources, and trigger self-healing — all through the API.

Quick Diagnosis

Call the diagnose endpoint to get a complete picture of a batch’s health:

import requests

response = requests.get(
    "https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/diagnose",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}
)
diagnostic = response.json()

print(f"Status: {diagnostic['status']}")
print(f"Failure category: {diagnostic['failure_category']}")
print(f"Failed docs: {diagnostic['failed_document_count']}")
for rec in diagnostic['recommendations']:
    print(f"  → {rec}")

The response includes:

status and failure_category — programmatic failure classification (infrastructure, timeout, orphaned, pipeline)
infrastructure_events — OOM, preemption, node failures, Ray bugs with timestamps
per_tier — timing, submission params, and resource details per tier
failed_documents_sample — first 10 failed documents with error details
recommendations — actionable next steps based on the failure type

Common Failure Scenarios

Out of Memory (OOM)

The diagnose endpoint will show failure_category: "infrastructure" with an infrastructure event of type oom. Fix: Retry the failed tier with more resources:

requests.post(
    f"https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/tiers/{tier_num}/retry",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"},
    json={"requires_gpu": True, "priority": 50}
)

Stuck Job

If a tier shows IN_PROGRESS but last_activity_at is stale (minutes old), the job may be stuck. Fix: Run stuck detection, then cancel the stuck tier:

# Detect stuck jobs
requests.post(
    f"https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/tiers/{tier_num}/heal",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"},
    json={"action": "detect_stuck"}
)

# Cancel the stuck tier
requests.post(
    f"https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/tiers/{tier_num}/cancel",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}
)

Duplicate Jobs

If multiple Ray jobs are running for the same extractor in a tier:

requests.post(
    f"https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/tiers/{tier_num}/heal",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"},
    json={"action": "kill_duplicates"}
)

Cancel a Single Job (Not the Whole Batch)

If one extractor job in a multi-extractor tier is failing but others are fine:

requests.post(
    f"https://api.mixpeek.com/v1/buckets/{bucket_id}/batches/{batch_id}/tiers/{tier_num}/jobs/{ray_job_id}/cancel",
    headers={"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}
)

Submission Parameters

Every batch tier now persists its submission parameters — the resources, GPU setting, plugins, and entrypoint used when the job was submitted. Access them in the batch response:

{
  "tier_tasks": [{
    "submission_params": {
      "entrypoint": "python -m engine.pipelines.entrypoint",
      "deployment_mode": "gke",
      "requires_gpu": true,
      "num_cpus": 1,
      "memory_bytes": 8589934592,
      "priority": 100,
      "plugin_archives": null,
      "extractor_name": "universal_extractor_v1"
    }
  }]
}

Stage Timing Breakdown

The batch progress now includes stage_history — a timing breakdown of each completed processing stage:

{
  "progress": {
    "stage_history": [
      {"name": "loading", "duration_seconds": 2.5},
      {"name": "processing", "duration_seconds": 45.3},
      {"name": "writing", "duration_seconds": 8.1}
    ]
  }
}

Use this to identify bottlenecks — if “processing” takes 90% of the time, the extractor itself is the bottleneck. If “writing” is slow, the vector store may be under pressure.

Documentation Index

​Quick Diagnosis

​Common Failure Scenarios

​Out of Memory (OOM)

​Stuck Job

​Duplicate Jobs

​Cancel a Single Job (Not the Whole Batch)

​Submission Parameters

​Stage Timing Breakdown