Errors & Retries
Errors surface in two places:
- HTTP responses on the submit / get / update endpoints — validation, auth, rate-limit, or server issues.
- Step status inside a returned workflow — the request succeeded, but a step reached
failed,expired, orcanceled.
The orchestrator races providers internally, so transient provider failures (one worker crashing, one region being slow) are usually invisible to you — a different provider claims the job. You only see step-level failures once every viable provider has been exhausted.
HTTP error shape
Validation and error responses use the standard RFC 7807 application/problem+json shape (ProblemDetails / ValidationProblemDetails):
{
"type": "https://tools.ietf.org/html/rfc9110#section-15.5.1",
"title": "One or more validation errors occurred.",
"status": 400,
"errors": {
"steps[0].input.resolution": [
"The value '4k' is not valid for resolution."
]
}
}Fields:
status— the HTTP status code (also in the response line).title— short, human-readable summary.detail— longer description when the orchestrator can give one.errors— present on400validation failures; JSON Pointer–style paths → list of messages.instance/type— optional URIs identifying the specific occurrence and error class.
HTTP status taxonomy
| Code | Meaning | Retry? |
|---|---|---|
400 Bad Request | Body failed validation. errors map tells you exactly which fields. | No — fix the request. |
401 Unauthorized | Missing / expired / malformed bearer token. | No — obtain a new token. |
403 Forbidden | Token is valid but can't perform this operation (recipe not enabled for your tier, mature content not permitted, etc.). | No — request access; don't retry as-is. |
404 Not Found | Workflow or blob ID doesn't exist (or the token can't see it). | No — check the ID. |
409 Conflict | The workflow is in a state that blocks this mutation (e.g. updating a step that already started). | Maybe — refetch and reconcile. |
429 Too Many Requests | Rate-limit hit. | Yes, with backoff. |
5xx Server Error | Transient orchestrator issue. | Yes, with backoff. |
Retry guidance for 429 / 5xx: exponential backoff with jitter, capped at ~30 s between attempts, give up after ~5 tries. Don't retry 400 / 401 / 403 / 404 until you've fixed the underlying issue.
Step-level failures
A 200 / 202 on submit means the workflow was accepted — individual steps can still fail later. The Workflow payload you get back (or that arrives by webhook) carries per-step status:
{
"id": "wf_01HXYZ...",
"status": "failed",
"steps": [
{
"name": "0",
"$type": "videoGen",
"status": "failed",
"jobs": [
{
"status": "failed",
"reason": "no_provider_available",
"blockedReason": null
}
]
}
]
}Workflow / step / job statuses share the same enum: unassigned, preparing, scheduled, processing, succeeded, failed, expired, canceled. Terminal states are succeeded, failed, expired, canceled — once reached, they do not change.
The reason and blockedReason fields on failed jobs are the best hint at why:
reason | What it means | Your move |
|---|---|---|
no_provider_available | No provider can run this job with the given inputs (unusual resolution, unsupported duration, restricted region, etc.). | Relax inputs, try another provider/version, or retry later. |
blocked | The job was blocked by content moderation. blockedReason explains further. | Don't retry the same input; rework the prompt or image. |
timeout / expired | Job exceeded its internal deadline. | Safe to resubmit — possibly with a smaller workload. |
canceled | Someone (you or an operator) canceled the workflow via DeleteWorkflow. | No retry unless you actually want to re-run it. |
When reason is absent, the failure is generic — safe to retry once with the same body.
Webhook retries
If you've registered callbacks, the orchestrator retries transient failures on your endpoint automatically. See Results & webhooks → Delivery semantics for the serialization guarantees you can rely on.
Common gotchas
- Blob URLs 403 after a few minutes. The signed URL expired — refetch the workflow (or call
GetBlob) for a fresh one. This isn't a real failure. 202afterwait=90. The workflow didn't finish within the 100-second request timeout. Expected for video / training / large-batch jobs — continue via webhooks or polling.- Step
canceledunexpectedly. Check whether another process calledDeleteWorkflow. The orchestrator itself only cancels on explicit request or when a dependent step already failed.