Skip to content

Multi-speaker dialogue

The audioMix step overlays multiple audio clips on a single timeline, each placed at its own start offset with optional per-track volume and fades. Pair it with N textToSpeech steps to produce multi-speaker dialogue, debate, or audio drama — including overlap, interruption, and cross-talk that single-utterance TTS can't model on its own.

Why not a multi-speaker TTS engine?

Qwen3 TTS synthesises one continuous utterance per request with no silence-injection or speaker switching. Asking the model to "say A, then pause, then say B" produces unpredictable prosody. Generating each line as its own short, clean TTS step and overlaying them with audioMix keeps every utterance natural while letting you place them anywhere on the output timeline — including overlapping intervals, which is the only way to get genuine cross-talk.

How it composes

Every dialogue workflow has the same shape:

  1. One textToSpeech step per spoken line — each step picks its own speaker (built-in voice, voice clone, or voice design) and produces a short clean clip.
  2. One trailing audioMix step referencing each TTS output via $ref. By default, tracks play back-to-back in array order — no timing math required.
json
{
  "steps": [
    { "$type": "textToSpeech", "name": "alice", "input": { /* ... */ } },
    { "$type": "textToSpeech", "name": "bob",   "input": { /* ... */ } },
    {
      "$type": "audioMix",
      "input": {
        "tracks": [
          { "url": { "$ref": "alice", "path": "output.audioBlob.url" } },
          { "url": { "$ref": "bob",   "path": "output.audioBlob.url" } }
        ]
      }
    }
  ]
}

Timeline rules

Each track resolves a start time on the output timeline using these rules:

  • Implicit (default): track i starts when track i-1 ends (in array order). No fields needed.
  • offset: float in seconds, nudges the implicit position. Negative = overlap/interrupt; positive = gap. offset: -0.5 means "start 500 ms before the previous track ends".
  • startSeconds: absolute timeline anchor. When set, the track plays at exactly this time and is excluded from the implicit chain — perfect for music beds. Other tracks in the array compute their implicit position as if anchored tracks weren't there.

If both startSeconds and offset are set on the same track, startSeconds wins.

The output is a single Ogg Vorbis blob plus a tracks[] array echoing each input's resolved startSeconds and probed duration — convenient for rendering subtitles or speaker highlights without re-probing.

Sequential reading

Three speakers, each clip placed after the previous one ends. No overlap; the gaps between clips are silent.

POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

Crosstalk and interruption

A speaker starts before the previous one finishes. ffmpeg's amix sums the overlapping samples, so the two voices are audible simultaneously. Small fadeInMs on the interrupter softens the entry.

POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

For a "hot debate" effect, set offset: -0.3 to -1.0 on each interrupter — negative offsets pull the track earlier on the timeline. Use mild attenuation (volumeDb: -1 to -3) on whichever speaker should sit slightly back in the mix.

Adding a music or ambience bed

The url field also accepts a direct URL string — no $ref needed — so you can drop in static background music or ambience under a voice track.

POST/v2/consumer/workflows
Set your Civitai API token via the Token button in the navbar to enable Try It.
Request body — edit to customize (e.g. swap the image URL or prompt)
Valid JSON

The bed sits at volumeDb: -18 (well under speech), fades in over 500 ms, and fades out over 1.5 s. Keep beds at -15 dB or lower against speech.

Input fields

audioMix step

FieldRequiredDefaultNotes
tracksArray of tracks to overlay. At least one.
normalizefalseWhen true, ffmpeg's amix divides by N to avoid clipping. Keep false when you've set per-track volumeDb and want the levels you specified.
maxDurationSeconds600Server-side cap on output length. The job fails early if the union of track intervals exceeds this.

Per-track fields

FieldRequiredDefaultNotes
urlEither a direct "https://..." URL string, or a { "$ref": "<step-name>", "path": "output.audioBlob.url" } referencing a prior step's output.
startSecondsimplicitAbsolute timeline anchor. Set this to pin a track to a fixed time (music bed, ambience). When set, the track is taken out of the implicit-sequencing chain. When unset, the track plays after the previous non-anchored track ends.
offset0Seconds to nudge this track from its implicit position. Negative = overlap/interrupt; positive = gap. Ignored when startSeconds is set.
volumeDb0Per-track gain in dB. -3 halves perceived loudness; -18 is a typical music-bed level.
fadeInMs0Linear fade-in length applied at the track's resolved start.
fadeOutMs0Linear fade-out applied at the track's tail (resolved start + duration − fadeOutMs).

Reading the result

json
{
  "status": "succeeded",
  "steps": [
    { "name": "opener", "$type": "textToSpeech", "output": { "audioBlob": { /* ... */ } } },
    { "name": "pro",    "$type": "textToSpeech", "output": { "audioBlob": { /* ... */ } } },
    { "name": "con",    "$type": "textToSpeech", "output": { "audioBlob": { /* ... */ } } },
    {
      "name": "3",
      "$type": "audioMix",
      "status": "succeeded",
      "output": {
        "audioBlob": {
          "id": "ZXNS7C...ogg",
          "url": "https://orchestration-new.civitai.com/v2/consumer/blobs/ZXNS7C...ogg?sig=...",
          "duration": 18.2
        },
        "tracks": [
          { "startSeconds":  0.0, "duration": 5.7 },
          { "startSeconds":  5.7, "duration": 5.9 },
          { "startSeconds": 11.6, "duration": 6.2 }
        ]
      }
    }
  ]
}
  • audioBlob.url — signed URL for the mixed Ogg Vorbis output. Stream it directly in an <audio src> tag.
  • audioBlob.duration — total output length in seconds (max of resolved startSeconds + duration across tracks).
  • tracks[] — per-input timing in the order they were submitted, useful for rendering subtitle overlays or speaker-highlight UI.

Runtime

The audioMix step itself is cheap — typically a second or two of ffmpeg on a CPU worker. The wall-clock for the whole workflow is dominated by the TTS steps, which run in parallel on the Qwen workers and each take 60–120 s for a short line. Submit with wait=0 and poll, the same as plain textToSpeech.

Cost

audioMix itself is freeFactors: [], no fixed cost. The expensive work, generating each utterance, is already priced on the underlying textToSpeech steps under their existing per-character formula. See the text-to-speech recipe for the TTS pricing.

The server enforces maxDurationSeconds (default 600) as a guardrail against runaway mixes; raise it on the input if you genuinely need longer output.

Troubleshooting

SymptomLikely causeFix
400 validation error, "AudioMix track has no resolved url"A $ref failed to resolve — the referenced step name doesn't match, or its output's path doesn't existConfirm each prior step has a unique name and that the $ref.path is "output.audioBlob.url" (case-sensitive).
Output clips/distorts when speakers overlapTwo or more tracks at volumeDb: 0 summing to over full-scaleEither set "normalize": true, or attenuate the louder track with volumeDb: -3 to -6.
Music bed too loud against speechBed volumeDb too highDrop the bed to -18 to -24 dB; voice tracks at 0 dB then sit cleanly on top.
Cross-talk sounds abruptInterrupter starts with no fadeAdd fadeInMs: 60–120 on the interrupting track.
failed, "AudioMix output would be Xs, exceeding MaxDurationSeconds"Resolved startSeconds + duration exceeds the cap on at least one trackRaise maxDurationSeconds on the input, or shorten the tracks.
Mix succeeded but tracks[] is emptyThe middleware succeeded but the timing payload didn't propagate (rare)The audioBlob.duration is still reliable; recompute per-track timing client-side from the inputs you sent.

Civitai Developer Documentation