Text-to-speech

The textToSpeech step type synthesises audio from text. Two modes on the same step:

Built-in speakers — nine curated voices (aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu, vivian). Pass speaker: "<name>" and go.
Voice cloning — pass a refAudioUrl (and optionally the reference's transcript) and the output speaks in that voice.

Output is an Ogg Vorbis audio blob.

Prerequisites

A Civitai orchestration token (Quick start → Prerequisites)
Text to synthesise (English or Chinese work best; language auto-detected by default)
For voice cloning only: a short reference audio clip URL (≤ ~10 s clean speech) and, ideally, its transcript

The simplest request

Built-in speaker, auto-detected language, wait=0 because TTS typically runs longer than the synchronous 100-second window:

http

POST https://orchestration.civitai.com/v2/consumer/recipes/textToSpeech?wait=0
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "text": "Welcome to Civitai, the home of open-source generative AI.",
  "engine": "custom",
  "speaker": "vivian",
  "xVectorOnlyMode": false,
  "language": "English"
}

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

The response (after polling to succeeded) is a full Workflow whose step carries an audioBlob with a signed streaming URL.

Use wait=0 for TTS

End-to-end processing for a single sentence is ~60–120 s including model load and queue wait. Short clips can sneak in under wait=30 on a warm node, but relying on it is brittle — wait=0 + polling is the safe default.

Via the generic workflow endpoint

Use this path for webhooks, tags, or chaining with other steps:

http

POST https://orchestration.civitai.com/v2/consumer/workflows?wait=0
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "steps": [{
    "$type": "textToSpeech",
    "input": {
      "text": "Welcome to Civitai, the home of open-source generative AI.",
      "engine": "custom",
      "speaker": "dylan",
      "xVectorOnlyMode": false,
      "language": "English"
    }
  }]
}

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

Input fields

See the TextToSpeechInput schema for the complete definition. The engine property is a discriminator — each engine has its own set of valid fields.

Shared fields

Field	Required	Default	Notes
`text`	✅	—	The text to synthesise. No hard length cap, but generation time scales roughly linearly with length.
`engine`	✅	`custom`	Which TTS backend to route to. `custom` covers both built-in speakers and voice cloning on one schema.
`language`		`Auto`	Full English language name: `"English"`, `"Chinese"`, or `"Auto"`. ISO codes like `"en"` may not be recognised by the model — use the full English name.

`custom` engine

The custom engine flattens both modes into one request. Whether you're using a built-in speaker or cloning depends on which fields you provide:

Field	Required	Used in mode	Notes
`speaker`	✅ for built-in	CustomVoice	One of `aiden`, `dylan`, `eric`, `ono_anna`, `ryan`, `serena`, `sohee`, `uncle_fu`, `vivian`.
`instruct`		CustomVoice	Optional free-text style/tone instruction (e.g. `"Speak in a cheerful and enthusiastic tone."`).
`refAudioUrl`	✅ for cloning	Base	URL of a reference audio clip (HTTP(S) URL or AIR URN).
`refText`	✅ for cloning (unless `xVectorOnlyMode: true`)	Base	Accurate transcript of the reference audio — helps the model align voice features to phonemes.
`xVectorOnlyMode`	✅	both	`false` for most cases. `true` skips `refText` and uses only the speaker embedding from the reference — faster to wire up, slightly lower quality.
`maxNewTokens`		both	Generation cap, optional. Leave unset unless you're seeing runaway output.

Built-in speakers with a style prompt

instruct lets you nudge tone/pacing without switching voices. Example: the dylan voice as a cheerful broadcaster:

json

{
  "text": "Breaking news — we are live from the Civitai studios.",
  "engine": "custom",
  "speaker": "dylan",
  "instruct": "Speak in a cheerful and enthusiastic broadcaster tone.",
  "xVectorOnlyMode": false,
  "language": "English"
}

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

Short, specific directions work better than long prose ("slow and serious" beats a paragraph). The model doesn't always follow the instruction — treat it as a bias, not a guarantee.

Voice cloning (Base mode)

Pass refAudioUrl + refText to clone the voice from a short reference clip:

json

{
  "text": "This sentence is synthesized by cloning the voice from the reference audio.",
  "engine": "custom",
  "refAudioUrl": "https://.../reference.wav",
  "refText": "She had your dark suit in greasy wash water all year.",
  "xVectorOnlyMode": false,
  "language": "English"
}

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

Guidance for the reference:

Length: 5–15 seconds of clean speech works best. Longer clips don't help and may time out.
Quality: single speaker, minimal background noise, no music. Podcast intros or call-recording noise floor drop quality noticeably.
refText accuracy: transcribe the reference exactly — including punctuation and capitalisation — or skip it via xVectorOnlyMode: true.
Reach: the refAudioUrl must be fetchable by the orchestrator the same way transcription's mediaUrl is. CDN-served files are safe; sites that inject per-request session state break the streaming fetch.

`xVectorOnlyMode: true` — skip the reference transcript

If you don't have (or don't want to supply) a transcript, set xVectorOnlyMode: true. The model uses only the speaker embedding from the reference clip, no alignment:

json

{
  "text": "Using only the speaker embedding from the reference — no transcript required.",
  "engine": "custom",
  "refAudioUrl": "https://.../reference.wav",
  "xVectorOnlyMode": true,
  "language": "English"
}

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

Trade-off: one less input to get right; slightly less faithful cloning on sentences whose phonetic content differs from the reference. Start with xVectorOnlyMode: false when quality matters.

Chaining: transcribe then re-speak

A common pipeline — transcribe an existing clip, then synthesise the same text in a different voice:

json

{
  "steps": [
    {
      "$type": "transcription",
      "name": "quote",
      "input": {
        "mediaUrl": "https://cdn.jsdelivr.net/gh/openai/whisper@main/tests/jfk.flac"
      }
    },
    {
      "$type": "textToSpeech",
      "name": "reread",
      "input": {
        "text": { "$ref": "quote", "path": "output.text" },
        "engine": "custom",
        "speaker": "dylan",
        "instruct": "Speak in a confident presidential tone.",
        "xVectorOnlyMode": false,
        "language": "English"
      }
    }
  ]
}

The { "$ref": "quote", "path": "output.text" } reference feeds the transcribed string into reread's text field at runtime. See Workflows → Dependencies.

Set your Civitai API token via the Token button in the navbar to enable Try It.

Request body — edit to customize (e.g. swap the image URL or prompt)

Valid JSON

Reading the result

json

{
  "status": "succeeded",
  "steps": [{
    "name": "0",
    "$type": "textToSpeech",
    "status": "succeeded",
    "output": {
      "audioBlob": {
        "id": "XSSD3Y6B6BSPFBC3QHV0WD8QJ0.ogg",
        "url": "https://orchestration-new.civitai.com/v2/consumer/streaming-blobs/XSSD3Y6B6BSPFBC3QHV0WD8QJ0.ogg?sig=...",
        "urlExpiresAt": "2027-04-13T23:44:20Z",
        "type": "audio",
        "duration": null
      },
      "speaker": "vivian",
      "modelType": "custom_voice"
    }
  }]
}

Fields:

audioBlob.url — signed streaming URL for the generated audio (Ogg Vorbis, .ogg). Stream it directly in an <audio src> tag or download the bytes.
audioBlob.id — blob identifier, also usable via GetBlob if you need a fresh URL later.
audioBlob.duration — output length in seconds when available (may be null until the blob is fully materialised).
speaker — the speaker name used (only populated for CustomVoice / built-in modes).
modelType — "custom_voice" for built-in speakers, "base" for reference-audio cloning. null if the backend didn't classify.

Blob URLs are signed and expire — store the audio locally or call GetBlob when the URL expires.

Cost

Billed in Buzz on the workflow's transactions. Use whatif=true for an exact preview; see Payments (Buzz) for currency selection.

Character-based with a minimum floor, multiplied by 2.6 when a built-in speaker is used:

textLength = max(1, ceil(len(text) / 100))
total      = textLength × (speaker != null ? 2.6 : 1)

Shape	Buzz
Base mode (voice cloning), 60 characters	1
Base mode, 500 characters	~5
Base mode, 2 000 characters	~20
CustomVoice (built-in `speaker`), 60 characters	~2.6
CustomVoice, 500 characters	~13
CustomVoice, 2 000 characters	~52

Voice cloning via refAudioUrl is cheaper per character than picking a built-in speaker — the 2.6× multiplier only applies when speaker is set. instruct, language, and maxNewTokens don't affect cost.

Runtime

End-to-end time for one short sentence is typically 60–120 seconds including model load, queue wait, and inference. Longer text scales roughly linearly. Always submit with wait=0 and poll or subscribe to webhooks; wait=30 synchronous calls will usually time out.

Troubleshooting

Symptom	Likely cause	Fix
`400` validation error on `speaker`	Value not in the nine built-in names	Use one of the allowed speakers; check the `CustomTextToSpeechInput` schema.
`400` validation error on `language`	Passed an ISO code like `"en"`	Use the full English name: `"English"`, `"Chinese"`, or `"Auto"`.
`400` validation error on `xVectorOnlyMode`	Missing — the field is required on the `custom` engine	Always include it; set to `false` unless you explicitly want x-vector-only cloning.
Voice cloning output sounds robotic	Reference clip is too noisy, too short, or contains multiple speakers	Supply a cleaner 5–15 s single-speaker reference.
Voice cloning ignores the reference entirely	`refAudioUrl` couldn't be fetched by the worker (cookie-gated host, 403, redirect loop)	Host the reference on a CDN / S3 direct / GitHub raw URL.
Prosody doesn't match `instruct`	The directive is too long or contradictory with the speaker's natural register	Keep `instruct` short and specific; try a different built-in speaker.
Request timed out (`wait` expired)	Synthesis too slow to finish in the synchronous window	Resubmit with `wait=0` and poll, or register a webhook.
Step `failed`, `reason = "blocked"`	Text or reference audio hit content moderation	Don't retry the same input — see Errors & retries → Step-level failures.

InvokeTextToSpeechStepTemplate — the per-recipe endpoint
Endpoint OpenAPI spec — standalone OpenAPI 3.1 YAML for this endpoint, ready to import into Postman / Insomnia / OpenAPI Generator
SubmitWorkflow — generic path for chaining
Multi-speaker dialogue — overlay several TTS clips for debate, interview, or audio-drama scenes (uses the composeMedia step)
Transcription — the inverse: audio → text
ACE-Step music generation — lyrics + style → full song audio (different recipe, sibling capability)
Results & webhooks — handling long-running workflows
Workflows → Dependencies — feeding output.text into a TTS step via $ref

Text-to-speech ​

Prerequisites ​

The simplest request ​

Via the generic workflow endpoint ​

Input fields ​

Shared fields ​

custom engine ​

Built-in speakers with a style prompt ​

Voice cloning (Base mode) ​

xVectorOnlyMode: true — skip the reference transcript ​

Chaining: transcribe then re-speak ​

Reading the result ​

Cost ​

Runtime ​

Troubleshooting ​

Related ​

Text-to-speech

Prerequisites

The simplest request

Via the generic workflow endpoint

Input fields

Shared fields

`custom` engine

Built-in speakers with a style prompt

Voice cloning (Base mode)

`xVectorOnlyMode: true` — skip the reference transcript

Chaining: transcribe then re-speak

Reading the result

Cost

Runtime

Troubleshooting

Related