Series · Aliyun Bailian · Chapter 5

Aliyun Bailian (5): Qwen-TTS for Multilingual Voice

Qwen-TTS-Flash for production: native-only API, the 40+ voice catalog (including Cantonese and Sichuanese), streaming synthesis, and the SSML quirks that the docs gloss over.

The reason every Chinese-language product I’ve worked on ends up calling Qwen-TTS-Flash isn’t price — there are cheaper TTS APIs. It’s that Qwen-TTS is the only one that handles mainland Chinese dialects (Cantonese, Sichuanese, Wu) and English in the same SDK, with voices that don’t sound like a 2019 customs announcement. After about six months of using it for a marketing-video voice-over pipeline, this is what I wish someone had told me on day one.

Voice catalog

Per the model card, Qwen-TTS-Flash exposes 40+ voices. The ones I use most:

Qwen-TTS voice catalogue

For Mandarin product narration my default is Cherry (warm, 30-something female) for marketing content and Ethan (steady, 40-something male) for tutorial / explainer content. For Cantonese ad spots Sunny is the safe choice. The voice names are stable but new voices are added regularly — fetch the canonical list from the model card before you pin one in production code.

Native API only

Qwen-TTS does not go through the OpenAI compat layer. You call it via the DashScope native SDK:

Qwen-TTS native call structure

A minimal request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import os, dashscope, requests
from dashscope.audio.qwen_tts import SpeechSynthesizer

dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]

resp = SpeechSynthesizer.call(
    model="qwen3-tts-flash",
    text="欢迎来到杭州,全城最安静的咖啡馆就在西湖边上。",
    voice="Cherry",
    format="mp3",
)

audio_url = resp.output.audio["url"]
with open("/tmp/out.mp3", "wb") as f:
    f.write(requests.get(audio_url, timeout=30).content)

Two things to underline:

  • The output is a URL by default, not bytes. Same as Wanxiang, download it within 24 hours (I do it immediately and re-upload to my own OSS bucket).
  • format defaults to mp3. WAV is also available; for downstream concatenation work I prefer WAV because there’s no MP3 header overhead per chunk.

Streaming TTS — when latency matters

For voice-bot use cases (real-time conversational UIs) you want streaming. The deltas are audio bytes you can write straight to a player or a file:

Streaming TTS

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from dashscope.audio.qwen_tts import SpeechSynthesizer

with open("/tmp/streamed.mp3", "wb") as f:
    for resp in SpeechSynthesizer.call(
        model="qwen3-tts-flash",
        text="这段语音是从模型那边一段一段流式返回的。",
        voice="Cherry",
        stream=True,
    ):
        if resp.output and resp.output.audio:
            f.write(resp.output.audio["data"])

Time-to-first-byte on streaming is typically under 400ms in Shanghai region, which is fast enough that a user perceives it as immediate. Non-streaming for a 30-second utterance is closer to 4-6 seconds wall clock — fine for batch narration, sluggish for chat.

Multi-language and dialect specifics

Qwen-TTS does language detection from the text, but if you mix scripts you should set language explicitly. My production rules:

  • Pure Mandarin text → language="zh" (default).
  • Pure English text → language="en". Voices like Eric shine here.
  • Cantonese text in Traditional Chinese → language="zh-yue", voice Sunny or Lily.
  • Mixed CJK + English (the common case for tech narration) → leave language unset, the model handles code-switch surprisingly well.

Tip: For dialect work, always A/B against native speakers before launch. Qwen-TTS Cantonese is good but not perfect on tones — a one-syllable tone error in Cantonese can change the meaning entirely.

SSML — what works, what doesn’t

The docs list SSML support but are quiet about which tags actually behave. From experience:

  • <break time="500ms"/> — works. Use for pauses between sentences in marketing copy.
  • <emphasis level="strong"> — works.
  • <prosody rate="slow"> — works. slow, medium, fast, or numeric percentage.
  • <prosody pitch="..."> — works for relative changes (e.g. +10%).
  • <say-as interpret-as="digits"> — works for phone numbers, codes, dates.
  • <phoneme> — partial. Tone-marked pinyin is more reliable than IPA on Chinese.
  • <voice> — does NOT work. You cannot mid-utterance switch voice. Use separate calls and concatenate.

Concatenating clips for narration

Marketing scripts are long. The pattern for a 60-second voiceover:

1
2
3
4
5
6
7
8
9
def synthesize_long(script: str, voice: str = "Cherry") -> str:
    sentences = split_sentences(script)  # your splitter; basic regex is fine
    parts = []
    for s in sentences:
        resp = SpeechSynthesizer.call(model="qwen3-tts-flash", text=s,
                                       voice=voice, format="wav")
        parts.append(download(resp.output.audio["url"]))
    # ffmpeg concat is the simplest reliable concat for WAV
    return concat_wavs(parts, output="/tmp/full.wav")

Why split? Two reasons: (1) per-call latency is much lower for short utterances, so synthesizing in parallel is faster; (2) you can patch a single bad sentence by re-rolling just that one without redoing the whole take. We use this for ad spots — the <break> tags handle the inter-sentence pauses, and the parallel synth means a 60-second clip is ready in ~4 seconds.

Cost

Per-second-of-output-audio billing. Streaming and non-streaming bill identically. A 60-second ad spot is in the few-RMB range — much cheaper than the cost of a voice actor’s hourly rate, and fast enough for the marketing team to iterate dozens of variations in an afternoon.

Closing the series

That’s the five. To recap:

  • Article 1 — Bailian / DashScope orientation.
  • Article 2 — the Qwen LLM family and the fiddly bits (function calling, JSON mode, enable_thinking).
  • Article 3 — Qwen-Omni for multimodal understanding.
  • Article 4 — Wanxiang for video generation.
  • Article 5 — Qwen-TTS for voice (this article).

The companion Aliyun PAI series covers DSW / DLC / EAS — the self-managed GPU layer where you train and serve your own models. Most teams I work with end up using both: Bailian when they want someone else’s pre-trained model, PAI when they need control of the weights. Pick by what you actually need to control, not by what looks more impressive on a resume.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub