Skip to content

Transcription

The transcription step type takes an audio or video URL and returns the spoken text, using Qwen3-ASR-1.7B for recognition plus Qwen3-ForcedAligner-0.6B for timestamp alignment. It handles dozens of languages out of the box, auto-detects the spoken language, and can return phrase-level timestamps (one entry per spoken phrase/clause, each containing multiple words) suitable for captions and seek-aware UIs.

Common uses:

  • Transcribe podcasts, interviews, voice memos
  • Generate captions (SRT / VTT) for video content via timestamps
  • Feed speech into text-processing pipelines (summarization, search indexing)
  • Pull the dialogue out of an existing video

Prerequisites

  • A Civitai orchestration token (Quick start → Prerequisites)
  • A publicly-fetchable audio or video URL (.mp3, .wav, .m4a, .mp4 with an audio track, etc.). Civitai CDN URLs work directly.

The simplest request

Use the per-recipe endpoint when you just need the text from one piece of audio:

http
POST https://orchestration.civitai.com/v2/consumer/recipes/transcription?wait=0
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "mediaUrl": "https://.../interview.mp3"
}
POST/v2/consumer/recipes/transcription
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

Defaults: language is auto-detected, word-level timestamps are returned. The response is a full Workflow whose single step carries the transcript in output.text.

Choosing a source URL

The URL must be publicly fetchable AND served by a host that supports HTTP range requests and consistent responses across requests — ffprobe streams + seeks rather than downloading the whole file. Sites that inject per-request session cookies (common on wp-content/uploads endpoints behind AWS ALBs) often break the seek and fail with Failed to read frame size: Could not seek to N. CDN-served files (jsdelivr, GitHub raw, S3 without redirect) are safe defaults; the Civitai CDN works directly.

Use wait=0 for long audio

Billing is computed per 30 s of audio (minimum 1 unit), and real processing time roughly tracks audio length + queue wait. Anything longer than ~90 s of audio is a candidate for wait=0 + polling; a multi-minute file will blow past the 100-second request timeout.

Via the generic workflow endpoint

Equivalent request through SubmitWorkflow — use this path when you need webhooks, tags, or to chain with other steps:

http
POST https://orchestration.civitai.com/v2/consumer/workflows?wait=0
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "steps": [{
    "$type": "transcription",
    "input": {
      "mediaUrl": "https://.../interview.mp3",
      "returnTimeStamps": true
    }
  }]
}
POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

Input fields

See the TranscriptionInput schema for the full definition.

FieldRequiredDefaultNotes
mediaUrlURL of the audio (or video with an audio track). Must be publicly fetchable without auth. ffprobe must be able to stream + seek the response (see tip above).
languageauto-detectISO 639-1 hint like "en", "zh", "es", "ja". Omit to let the model detect. Setting it anyway usually improves accuracy on short or noisy clips. The output language is returned as a full English name ("English", "Spanish", …), not the ISO code.
contextOptional free-text prompt describing the subject matter — helps the model spell unusual proper nouns, technical terms, or domain jargon correctly.
returnTimeStampstrueWhether to return word-level startTime / endTime pairs. Leave true unless you don't need them; the extra cost is negligible.

Language hints

Auto-detection is reliable on clear speech but can flip on short clips, heavily accented speakers, or audio that starts with non-speech (music, silence). If you know the language upfront, set it:

json
{
  "mediaUrl": "https://.../audio.mp3",
  "language": "en"
}
POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

The detected (or forced) language is returned in output.language — note it comes back as the full English name ("English", "Japanese", …), not the ISO code you passed in.

Context hints for accuracy

Provide a short free-text context to bias the model toward correct spellings for proper nouns, acronyms, or technical vocabulary. For a tech podcast:

json
{
  "mediaUrl": "https://.../podcast.mp3",
  "language": "en",
  "context": "Technical discussion about Kubernetes, CRDs, and Flux CD."
}
POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

Think of context like a prompt passed to the ASR model — a sentence or two of topic / vocabulary hints usually helps more than a long verbose description.

Generating captions / SRT

Video files work as a mediaUrl too — pass an .mp4 (or any container FFmpeg understands) and the audio track is extracted automatically. Combine with returnTimeStamps: true to get everything you need to emit an SRT or VTT file:

json
{
  "mediaUrl": "https://.../clip.mp4",
  "returnTimeStamps": true
}
POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

The output.timeStamps array holds one entry per spoken word, each with { text, startTime, endTime } in seconds. For subtitle generation, group adjacent word entries into phrase-sized chunks client-side; each chunk can then map directly to one caption line.

Reading the result

A successful transcription step emits the full transcript plus structured timing. Real output from the JFK clip above:

json
{
  "status": "succeeded",
  "steps": [{
    "name": "0",
    "$type": "transcription",
    "status": "succeeded",
    "output": {
      "text": "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.",
      "language": "English",
      "timeStamps": [
        { "text": "And so my fellow Americans",         "startTime": 0.32, "endTime": 2.16 },
        { "text": "ask not",                           "startTime": 3.28, "endTime": 4.32 },
        { "text": "what your country can do for you",  "startTime": 5.36, "endTime": 7.52 },
        { "text": "Ask what you can do for your country", "startTime": 8.16, "endTime": 10.48 }
      ],
      "elapsedSeconds": 0.876
    }
  }]
}

Fields:

  • text — the full transcript as one string, with punctuation and casing restored
  • language — the detected (or hinted) language as a full English name (e.g. "English", "Mandarin", "Spanish"). Not the ISO code you pass in.
  • timeStamps — one entry per phrase/clause (spans multiple words each); empty array if returnTimeStamps: false
  • elapsedSeconds — server-side model runtime (excludes queue wait — this is just the recognition pass)

Cost

Billed in Buzz on the workflow's transactions. Use whatif=true for an exact preview; see Payments (Buzz) for currency selection.

Duration-based with a minimum floor of 1:

total = max(1, ceil(durationSeconds / 30))
Audio lengthBuzz
≤ 30 s1
31–60 s2
5 min~10
30 min~60
60 min~120

Transcription is the cheapest speech path Civitai exposes — every 30 seconds of source is one Buzz, rounded up. The language, context, and returnTimeStamps fields don't affect cost.

Runtime

Real-time-factor (processing time ÷ audio length) is well below 1 on Qwen3-ASR — a 5-minute recording typically finishes in well under a minute of server-side compute, plus queue wait. Plan for wait=0 + polling on anything beyond ~90 s of audio.

Troubleshooting

SymptomLikely causeFix
400 with "Unable to analyze audio file: … Failed to read frame size: Could not seek to N"The host doesn't honor HTTP range requests or injects per-request session state (AWS ALB cookies, etc.), so ffprobe's streaming seek failsUse a CDN-served file (jsdelivr, GitHub raw, S3 direct, Civitai CDN) instead.
400 with "Unable to analyze audio file" (no seek error)Source couldn't be probed (corrupt, wrong container, DNS failure, 403/404, redirect loop)Verify the URL resolves with a direct curl and returns valid audio.
400 with "Input audio resource does not exist"Passed an AIR that doesn't resolvePass a plain URL instead, or confirm the AIR is correct.
output.language is wrongAuto-detection failed on a short / noisy clipSet language explicitly.
Proper nouns / jargon misspelledModel hasn't seen the term oftenAdd a context string describing the subject matter and vocabulary.
Empty or partial transcriptAudio contains long silence, music, or very low-level speechTrim silence / pre-normalize audio; confirm speech is actually audible at a reasonable volume.
Request timed out (wait expired)Audio too long to finish in the synchronous windowResubmit with wait=0 and poll, or register a webhook.
Step failed, reason = "blocked"Audio hit content moderationDon't retry the same input — see Errors & retries → Step-level failures.

Civitai Developer Documentation