Aliyun Bailian (5): Qwen-TTS for Multilingual Voice
Qwen-TTS-Flash for production: native-only API, the 40+ voice catalog (including Cantonese and Sichuanese), streaming synthesis, and the SSML quirks that the docs gloss over.
The reason every Chinese-language product I’ve worked on ends up calling Qwen-TTS-Flash isn’t price — there are cheaper TTS APIs. It’s that Qwen-TTS is the only one that handles mainland Chinese dialects (Cantonese, Sichuanese, Wu) and English in the same SDK, with voices that don’t sound like a 2019 customs announcement. After about six months of using it for a marketing-video voice-over pipeline, this is what I wish someone had told me on day one.
Voice catalog
Per the model card, Qwen-TTS-Flash exposes 40+ voices. The ones I use most:

For Mandarin product narration my default is Cherry (warm, 30-something female) for marketing content and Ethan (steady, 40-something male) for tutorial / explainer content. For Cantonese ad spots Sunny is the safe choice. The voice names are stable but new voices are added regularly — fetch the canonical list from the model card before you pin one in production code.
Native API only
Qwen-TTS does not go through the OpenAI compat layer. You call it via the DashScope native SDK:

A minimal request:
| |
Two things to underline:
- The output is a URL by default, not bytes. Same as Wanxiang, download it within 24 hours (I do it immediately and re-upload to my own OSS bucket).
formatdefaults tomp3. WAV is also available; for downstream concatenation work I prefer WAV because there’s no MP3 header overhead per chunk.
Streaming TTS — when latency matters
For voice-bot use cases (real-time conversational UIs) you want streaming. The deltas are audio bytes you can write straight to a player or a file:

| |
Time-to-first-byte on streaming is typically under 400ms in Shanghai region, which is fast enough that a user perceives it as immediate. Non-streaming for a 30-second utterance is closer to 4-6 seconds wall clock — fine for batch narration, sluggish for chat.
Multi-language and dialect specifics
Qwen-TTS does language detection from the text, but if you mix scripts you should set language explicitly. My production rules:
- Pure Mandarin text →
language="zh"(default). - Pure English text →
language="en". Voices likeEricshine here. - Cantonese text in Traditional Chinese →
language="zh-yue", voiceSunnyorLily. - Mixed CJK + English (the common case for tech narration) → leave language unset, the model handles code-switch surprisingly well.
Tip: For dialect work, always A/B against native speakers before launch. Qwen-TTS Cantonese is good but not perfect on tones — a one-syllable tone error in Cantonese can change the meaning entirely.
SSML — what works, what doesn’t
The docs list SSML support but are quiet about which tags actually behave. From experience:
<break time="500ms"/>— works. Use for pauses between sentences in marketing copy.<emphasis level="strong">— works.<prosody rate="slow">— works.slow,medium,fast, or numeric percentage.<prosody pitch="...">— works for relative changes (e.g.+10%).<say-as interpret-as="digits">— works for phone numbers, codes, dates.<phoneme>— partial. Tone-marked pinyin is more reliable than IPA on Chinese.<voice>— does NOT work. You cannot mid-utterance switch voice. Use separate calls and concatenate.
Concatenating clips for narration
Marketing scripts are long. The pattern for a 60-second voiceover:
| |
Why split? Two reasons: (1) per-call latency is much lower for short utterances, so synthesizing in parallel is faster; (2) you can patch a single bad sentence by re-rolling just that one without redoing the whole take. We use this for ad spots — the <break> tags handle the inter-sentence pauses, and the parallel synth means a 60-second clip is ready in ~4 seconds.
Cost
Per-second-of-output-audio billing. Streaming and non-streaming bill identically. A 60-second ad spot is in the few-RMB range — much cheaper than the cost of a voice actor’s hourly rate, and fast enough for the marketing team to iterate dozens of variations in an afternoon.
Closing the series
That’s the five. To recap:
- Article 1 — Bailian / DashScope orientation.
- Article 2 — the Qwen LLM family and the fiddly bits (function calling, JSON mode,
enable_thinking). - Article 3 — Qwen-Omni for multimodal understanding.
- Article 4 — Wanxiang for video generation.
- Article 5 — Qwen-TTS for voice (this article).
The companion Aliyun PAI series covers DSW / DLC / EAS — the self-managed GPU layer where you train and serve your own models. Most teams I work with end up using both: Bailian when they want someone else’s pre-trained model, PAI when they need control of the weights. Pick by what you actually need to control, not by what looks more impressive on a resume.