Series · Aliyun Bailian · Chapter 2

Aliyun Bailian (2): The Qwen LLM API in Production

Picking a Qwen variant by latency and cost, function calling done right, JSON mode without tears, and the enable_thinking + streaming requirement that the docs gloss over.

This is the article in the series where most of the production wins live. The other models are interesting; the LLMs are what every product I have shipped on Bailian has called every minute of every day. The official Qwen API reference is dense and complete; this article is the readable companion that picks one path through it.

Pick the right Qwen variant for the workload

The Qwen family is large. Most teams overspend by defaulting to qwen-max everywhere. Most teams underspend on quality by defaulting to qwen-turbo. The right answer is “match variant to job”:

Qwen model family

My production rules of thumb:

  • qwen-turbo — classification, intent detection, short summarization, anything you call >10× per user request. It’s the cheapest sane Qwen and surprisingly good at extraction.
  • qwen-plus — daily driver for chat, RAG synthesis, multi-step reasoning. The cost-vs-quality knee.
  • qwen-max / qwen3-max — code review, complex reasoning, anything where being wrong costs more than being slow.
  • qwen3-coder-plus — every code task. It is meaningfully better at code than general qwen-plus even at the same parameter scale.
  • qwen3-vl-plus / qwen3-omni-flash — image / video / audio in. Article 3 is dedicated to this.

Tip: A common mistake is using qwen-max for embedding-style classification. Don’t. Use qwen-turbo with a tight system prompt and you’ll cut cost 10× with no quality loss on tasks where you only need a label.

What actually goes over the wire

Independent of whether you use the OpenAI compat layer or DashScope native, the substance of a chat-completion request is the same: a model id, a messages array, and a parameter block.

Chat completion request flow

The fields you’ll touch most often:

  • messages — an array of {role, content}. Role is system / user / assistant / tool. The official docs note that for multimodal models content can be an array of typed parts (text, image_url, input_audio, video_url) — see article 3.
  • temperature — 0.0-2.0. I use 0.0 for extraction / classification, 0.2-0.4 for default chat, 0.7+ only for creative writing. The official docs default is around 0.7, which is too high for most agentic uses.
  • top_p — leave it at default unless you know exactly why you want to change it. Tweaking both temperature and top_p at once is a recipe for confusion.
  • max_tokens (compat) / parameters.max_tokens (native) — this is the output token cap, not total. Set it. Otherwise a runaway can cost you.
  • stream — toggle SSE streaming. See below.
  • response_format={"type": "json_object"} — JSON mode. Strongly recommended over “please return JSON” prompting.
  • tools / tool_choice — function calling.

Function calling: the round trip

The Qwen function-calling protocol is the OpenAI tool-calls protocol. Two LLM calls plus your code in the middle:

Function calling round-trip

A complete worked example — a tiny agent that can look up the weather:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import json, os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['DASHSCOPE_API_KEY'],
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

def call_weather(city):
    # Real impl: call your API. Stub here.
    return {"city": city, "temp_c": 22, "conditions": "sunny"}

messages = [{"role": "user", "content": "Should I bring an umbrella to Shanghai?"}]
resp = client.chat.completions.create(
    model="qwen-plus", messages=messages, tools=tools,
)
msg = resp.choices[0].message

if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = call_weather(**args)
        messages.append(msg)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })
    final = client.chat.completions.create(model="qwen-plus", messages=messages)
    print(final.choices[0].message.content)
else:
    print(msg.content)

Three things that bite:

  • messages.append(msg) is required between the first response and the tool result. The model needs to see its own tool_call message in the history, otherwise the second call returns a 400 about an “orphan tool result”.
  • tool_choice="auto" is the default. Force a specific tool with tool_choice={"type": "function", "function": {"name": "..."}} when you must — useful for the first call in a workflow.
  • parallel_tool_calls=True is supported. Use it when you have independent tools — the model will return multiple tool_calls in one shot.

JSON mode

For structured output, do not rely on prompting. Use:

1
2
3
4
5
6
7
8
9
resp = client.chat.completions.create(
    model="qwen-plus",
    messages=[
        {"role": "system", "content": "Return JSON: {\"sentiment\": \"positive|negative|neutral\"}"},
        {"role": "user", "content": "I love this product."},
    ],
    response_format={"type": "json_object"},
)
data = json.loads(resp.choices[0].message.content)

Two caveats from production:

  • The model will sometimes wrap JSON in ```json fences anyway. Defensive parsing with json.loads after stripping fences is wise.
  • For structured JSON (Pydantic schema) use the function-calling pattern instead. It’s stricter and the failure modes are easier to debug.

enable_thinking and the streaming trap

Qwen3 series models support enable_thinking=True — it asks the model to produce a reasoning chain before the final answer. Quality goes up, especially on reasoning-heavy tasks. But you must use streaming. Non-stream returns a 400.

enable_thinking + streaming

Practical pattern — collect the reasoning into a side log and stream the answer to your UI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
stream = client.chat.completions.create(
    model="qwen3-max",
    messages=[{"role": "user", "content": "If a clock loses 5 minutes a day, by how much will it be off after 10 days?"}],
    extra_body={"enable_thinking": True},
    stream=True,
)

reasoning, answer = [], []
for chunk in stream:
    delta = chunk.choices[0].delta
    # Qwen3 streams the reasoning chain in delta.reasoning_content
    rc = getattr(delta, "reasoning_content", None)
    if rc:  reasoning.append(rc)
    if delta.content: answer.append(delta.content)

print("ANSWER:", "".join(answer))
print("(reasoning hidden,", sum(len(r) for r in reasoning), "chars)")

I forward reasoning to my logging system and never to the user. Three reasons: (1) it leaks chain-of-thought IP if customers ever see it, (2) it confuses non-technical readers, (3) it doubles your visible response length.

Async chat (rare but useful)

If you have a very long-running chat (e.g. a 30k-token RAG synthesis), you can submit it async with X-DashScope-Async: enable and poll, same pattern as Wanxiang. The Qwen API reference documents this under “Asynchronous calling”. I use it for cron-batch summarization jobs that don’t need an immediate user-facing response.

Cost controls that actually work

  • Always set max_tokens. Default cap of “model max” means a runaway loop costs you a fortune.
  • Use a workspace key per environment. Set a hard daily budget on the prod key in the console under the workspace.
  • Log token counts. usage.prompt_tokens and usage.completion_tokens are in every response. Aggregate them weekly and you’ll spot the prompt that bloated by 3x without anyone noticing.
  • Cache identical prompts at your edge. DashScope does not currently expose prompt caching the way Anthropic does — so cache yourself for high-volume identical-prefix patterns.

What’s next

Article 3 is Qwen-Omni — the multimodal sibling. The big differences are: streaming is required (not optional), the content array gets typed parts for image / audio / video, and you have to think about pixel budgets and frame rates. It’s the highest-leverage capability in Bailian if your product touches non-text content.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub