Aliyun Bailian (2): The Qwen LLM API in Production
Picking a Qwen variant by latency and cost, function calling done right, JSON mode without tears, and the enable_thinking + streaming requirement that the docs gloss over.
This is the article in the series where most of the production wins live. The other models are interesting; the LLMs are what every product I have shipped on Bailian has called every minute of every day. The official Qwen API reference is dense and complete; this article is the readable companion that picks one path through it.
Pick the right Qwen variant for the workload
The Qwen family is large. Most teams overspend by defaulting to qwen-max everywhere. Most teams underspend on quality by defaulting to qwen-turbo. The right answer is “match variant to job”:

My production rules of thumb:
qwen-turbo— classification, intent detection, short summarization, anything you call >10× per user request. It’s the cheapest sane Qwen and surprisingly good at extraction.qwen-plus— daily driver for chat, RAG synthesis, multi-step reasoning. The cost-vs-quality knee.qwen-max/qwen3-max— code review, complex reasoning, anything where being wrong costs more than being slow.qwen3-coder-plus— every code task. It is meaningfully better at code than generalqwen-pluseven at the same parameter scale.qwen3-vl-plus/qwen3-omni-flash— image / video / audio in. Article 3 is dedicated to this.
Tip: A common mistake is using
qwen-maxfor embedding-style classification. Don’t. Useqwen-turbowith a tight system prompt and you’ll cut cost 10× with no quality loss on tasks where you only need a label.
What actually goes over the wire
Independent of whether you use the OpenAI compat layer or DashScope native, the substance of a chat-completion request is the same: a model id, a messages array, and a parameter block.

The fields you’ll touch most often:
messages— an array of{role, content}. Role issystem/user/assistant/tool. The official docs note that for multimodal modelscontentcan be an array of typed parts (text, image_url, input_audio, video_url) — see article 3.temperature— 0.0-2.0. I use 0.0 for extraction / classification, 0.2-0.4 for default chat, 0.7+ only for creative writing. The official docs default is around 0.7, which is too high for most agentic uses.top_p— leave it at default unless you know exactly why you want to change it. Tweaking bothtemperatureandtop_pat once is a recipe for confusion.max_tokens(compat) /parameters.max_tokens(native) — this is the output token cap, not total. Set it. Otherwise a runaway can cost you.stream— toggle SSE streaming. See below.response_format={"type": "json_object"}— JSON mode. Strongly recommended over “please return JSON” prompting.tools/tool_choice— function calling.
Function calling: the round trip
The Qwen function-calling protocol is the OpenAI tool-calls protocol. Two LLM calls plus your code in the middle:

A complete worked example — a tiny agent that can look up the weather:
| |
Three things that bite:
messages.append(msg)is required between the first response and the tool result. The model needs to see its own tool_call message in the history, otherwise the second call returns a 400 about an “orphan tool result”.tool_choice="auto"is the default. Force a specific tool withtool_choice={"type": "function", "function": {"name": "..."}}when you must — useful for the first call in a workflow.parallel_tool_calls=Trueis supported. Use it when you have independent tools — the model will return multipletool_callsin one shot.
JSON mode
For structured output, do not rely on prompting. Use:
| |
Two caveats from production:
- The model will sometimes wrap JSON in
```jsonfences anyway. Defensive parsing withjson.loadsafter stripping fences is wise. - For structured JSON (Pydantic schema) use the function-calling pattern instead. It’s stricter and the failure modes are easier to debug.
enable_thinking and the streaming trap
Qwen3 series models support enable_thinking=True — it asks the model to produce a reasoning chain before the final answer. Quality goes up, especially on reasoning-heavy tasks. But you must use streaming. Non-stream returns a 400.

Practical pattern — collect the reasoning into a side log and stream the answer to your UI:
| |
I forward reasoning to my logging system and never to the user. Three reasons: (1) it leaks chain-of-thought IP if customers ever see it, (2) it confuses non-technical readers, (3) it doubles your visible response length.
Async chat (rare but useful)
If you have a very long-running chat (e.g. a 30k-token RAG synthesis), you can submit it async with X-DashScope-Async: enable and poll, same pattern as Wanxiang. The Qwen API reference documents this under “Asynchronous calling”. I use it for cron-batch summarization jobs that don’t need an immediate user-facing response.
Cost controls that actually work
- Always set
max_tokens. Default cap of “model max” means a runaway loop costs you a fortune. - Use a workspace key per environment. Set a hard daily budget on the prod key in the console under the workspace.
- Log token counts.
usage.prompt_tokensandusage.completion_tokensare in every response. Aggregate them weekly and you’ll spot the prompt that bloated by 3x without anyone noticing. - Cache identical prompts at your edge. DashScope does not currently expose prompt caching the way Anthropic does — so cache yourself for high-volume identical-prefix patterns.
What’s next
Article 3 is Qwen-Omni — the multimodal sibling. The big differences are: streaming is required (not optional), the content array gets typed parts for image / audio / video, and you have to think about pixel budgets and frame rates. It’s the highest-leverage capability in Bailian if your product touches non-text content.