LLM on Chen Kai Blog

Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer

Thu, 07 May 2026 09:00:00 +0000

When I first needed an LLM API for a production app in China, my options were limited and expensive. Most international providers had no mainland endpoint, billing required a foreign credit card, and latency from calling US-based APIs was 800ms+ before a single token came back. Then Qwen showed up on DashScope with an OpenAI-compatible endpoint, and suddenly building AI products in China became as straightforward as anywhere else. Same SDK, same request shape, same streaming protocol — just a different base_url and a key from the Bailian console. I have been running production workloads against it for over a year now, and this article is the comprehensive walkthrough I wish I had on day one.

LLM Engineering (12): Production — Deployment, Monitoring, Cost

Tue, 07 Apr 2026 09:00:00 +0000

This is the last chapter. The previous ones covered building the model, the prompt, the retrieval, and the evaluation. This chapter focuses on maintaining it without breaking the bank. Production LLM serving is more like running a high-traffic web service than classical ML serving, except each web request costs money and can take up to two minutes.

I’ll focus more on numbers here than in earlier chapters. In production, the difference between a profitable feature and a money pit often boils down to a 2-5x cost factor that no one is tracking. The most useful skill to develop is back-of-the-envelope cost arithmetic for LLM workloads. The numbers below are accurate as of late 2025 / early 2026; verify them against current pricing before committing.

LLM Engineering (11): Safety and Alignment

Mon, 06 Apr 2026 09:00:00 +0000

Safety has the worst signal-to-noise ratio of any topic in LLM engineering. There’s a lot of philosophy, a lot of marketing, and not a lot of engineering specifics. This chapter is the engineering specifics: what RLHF actually optimizes when it talks about “safety,” how refusal calibration breaks, what red-teaming looks like in practice, the hallucination measures that actually predict customer impact, and the small but significant 2024-2026 papers (Sleeper Agents, refusal as a feature direction, weak-to-strong generalization) that should change how you think about alignment in production.

LLM Engineering (10): Evaluation

Sun, 05 Apr 2026 09:00:00 +0000

Evaluation is the part of the LLM stack where everyone has opinions but no one is confident. The leaderboards are gamed, the public benchmarks are contaminated, and most teams I’ve worked with had no eval set when I joined. This chapter covers what evaluation actually tells you, what the benchmarks hide, the LLM-as-judge biases that go unaddressed, the calibration metrics most teams skip, and the production patterns that catch regressions before customers notice.

LLM Engineering (9): Prompting at Production Scale

Sat, 04 Apr 2026 09:00:00 +0000

A prompt that works on 100 examples in a notebook can fail on 10% of inputs in production for reasons unrelated to cleverness. This chapter covers prompting as an engineering task: where chain-of-thought helps (and where it doesn’t), how prompt caching affects costs, how to combine few-shot, chain-of-thought, and self-consistency without using every trick, and how to defend against jailbreaks and injections that production traffic will generate within a week of launch.

LLM Engineering (8): Retrieval-Augmented Generation

Fri, 03 Apr 2026 09:00:00 +0000

RAG is the most over-deployed and under-engineered pattern in LLM applications. The 2024 demo loop — embed everything with text-embedding-3-large, dump into pgvector, top-5 cosine — works for 1000 documents and a forgiving demo. It does not survive 100K real documents and a customer who notices when the answer is wrong. This chapter is what I wish more teams knew before they built their second generation of RAG.

The original RAG paper (Lewis et al., 2020 ) framed retrieval-augmented generation as a hybrid model: a dense retriever (DPR) trained jointly with a generator (BART) so the retrieval objective optimized end-task accuracy. Production RAG in 2026 doesn’t look much like Lewis’s RAG — modern systems use frozen pre-trained embedders, separate rerankers, and decoder-only generators that don’t train against the retriever. But the core insight (parameterize knowledge separately from reasoning) survived and became the dominant paradigm. The Gao et al. (2023) RAG survey is the best comprehensive overview of the post-2020 evolution into “Naive RAG → Advanced RAG → Modular RAG.”

LLM Engineering (7): Function Calling and Tool Use

Thu, 02 Apr 2026 09:00:00 +0000

Function calling connects an LLM to the world outside its weights. It combines chat-template details (Chapter 2 ), structured-output kernels (Chapter 5 ), and prompt engineering (Chapter 9 ). This chapter explores what happens under the hood, the guarantees you can rely on, and the agent-loop patterns that handle real workloads.

The intellectual lineage matters. Tool use as an LLM capability traces back to two near-simultaneous papers in 2022: MRKL Systems (Karpas et al., AI21) which proposed expert-routing among neuro-symbolic modules, and ReAct (Yao et al., 2022 ) which interleaved chain-of-thought reasoning with tool actions. Toolformer (Schick et al., 2023 ) showed self-supervised teaching of tool use, generating training data by having a model insert tool-call markers into existing text. By 2024 every frontier model had post-training data structured around the tool-use format, and tool calling moved from “research demo” to “API feature.”

LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

Wed, 01 Apr 2026 09:00:00 +0000

“1M token context” is one of the most over-claimed numbers in LLMs. A model can attend to 1M tokens — that’s an architecture statement. A model can use information at position 800K to answer a question — that’s a behavior statement, and it’s more challenging. This chapter covers the math of position encoding, the engineering tricks that extend context beyond the training length, and why most long-context claims fail needle-in-a-haystack tests.

LLM Engineering (5): Inference Optimization

Tue, 31 Mar 2026 09:00:00 +0000

Inference is where the money goes. A single 70B-class model serving 1000 concurrent users at 50 tok/s consumes the GPU budget used to train the model in about 3 months. This chapter focuses on two key metrics: time-to-first-token (TTFT) and inter-token latency (ITL), and one ratio: GPU-seconds per million output tokens.

Training is a one-time capital expense, with costs spread over millions of inference calls. Inference, however, is a recurring operating expense that doesn’t amortize. A 50% (1.5x) improvement in tokens-per-GPU-second compounds daily over the product’s lifetime. That’s why every serious LLM team has at least one full-time engineer focused on inference, and why the open-source community has released four distinct waves of inference engines (FasterTransformer → DeepSpeed-Inference → vLLM → SGLang/TensorRT-LLM/llama.cpp) in five years.

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Mon, 30 Mar 2026 09:00:00 +0000

A base model from pretraining can complete text but cannot follow instructions, refuse harmful requests, or maintain a persona—these are post-training behaviors. Post-training is where the gap between a research paper’s claims and a production-grade model lies. This chapter covers what each post-training algorithm optimizes, why most reward models are subtly flawed, and the effective methods for 2026.

LLM Engineering (3): Pretraining at Scale

Sun, 29 Mar 2026 09:00:00 +0000

Pretraining is where most of an LLM’s capability comes from, and it’s also where the leaderboard-vs-reality gap is widest. Most published runs are heroic engineering more than they are scientific results. This chapter is about the parts of pretraining that you actually have to get right when you’re not OpenAI: the data, the parallelism choice, and the failure modes that only show up when the cluster is large enough to make a single bad NCCL all-reduce kill a 30-day run.

LLM Engineering (2): Tokenization Deep Dive

Sat, 28 Mar 2026 09:00:00 +0000

Tokenization is the layer everyone skips. It’s also the layer where I’ve debugged the most production bugs — silent quality regressions, mysterious cost spikes, models refusing to follow instructions because someone formatted the chat template wrong. This chapter is everything I wish I’d internalized before shipping a multilingual product.

What a tokenizer actually does#

A tokenizer maps a string to a list of integer IDs. Reverse maps IDs back to a string. Both directions are deterministic but not bijective in general — round-tripping tokenizer.decode(tokenizer.encode(s)) can lose whitespace, normalize Unicode, or collapse repeated punctuation, depending on the algorithm.

LLM Engineering (1): Architectures from Transformer to MoE

Fri, 27 Mar 2026 09:00:00 +0000

The 2017 Transformer block is still the silhouette of every production LLM in 2026, but almost every internal piece has been swapped, sparsified, or specialized. This series covers the modern stack end to end — architecture, training, inference, retrieval, evaluation, safety, deployment. Chapter 1 is about the block itself: what attention looks like in a 2026 model, how MoE breaks the param-FLOPs link, and where the non-attention alternatives (Mamba, RWKV) actually beat the Transformer.

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Sun, 22 Mar 2026 09:00:00 +0000

A pattern I see repeatedly in immature agent stacks: each agent has its own copy of OPENAI_API_KEY in its own .env file. Sometimes the same key, sometimes different ones, sometimes a colleague’s personal key from when they prototyped. When the bill arrives nobody can tell which agent caused which token spend, and when a key leaks (it always does) you’re playing whack-a-mole across a dozen .env files.

The real wake-up call hit me two years ago. A contractor finished his three-month engagement on a Friday, his laptop went home, and on the following Tuesday DashScope billing flagged 12 million tokens of qwen-max traffic from an IP we didn’t recognise. His personal API key — copy-pasted into a side project — was still sitting in our agent’s .env. Rotating it took six hours: three engineers, four repos, two CI pipelines, one panicked Slack thread. Never again.

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain

Sat, 07 Mar 2026 09:00:00 +0000

A DSW notebook is for one engineer on one GPU. When you need eight GPUs across two nodes or training that runs longer than eight hours, you switch to DLC. DLC is PAI’s job-submission front-end for a managed Kubernetes cluster. You describe what you want (image, command, resources, data mounts), and DLC schedules pods, runs them to completion, persists logs, and reports the results. The docs call this Deep Learning Containers; we just say “DLC job”.

Aliyun Bailian (2): The Qwen LLM API in Production

Thu, 26 Feb 2026 09:00:00 +0000

This article in the series covers most of the production wins. While the other models are interesting, the LLMs are what every product I’ve shipped on Bailian calls every minute of every day. The official Qwen API reference is dense and complete; this article is the readable companion that guides you through it.

Pick the right Qwen variant for the workload#

The Qwen family is large. Some teams overspend by defaulting to qwen-max everywhere; others underspend on quality by defaulting to qwen-turbo. The right answer is “match variant to job”:

Aliyun Bailian (1): Platform Overview and First Request

Wed, 25 Feb 2026 09:00:00 +0000

If you ship anything that touches Chinese-language users, sooner or later you will end up calling a Bailian model. Qwen-Max is the cheapest sane way to get GPT-4-class Chinese understanding, the Wanxiang video models are the only production-grade text-to-video API I can buy with a Chinese invoice, and Qwen-TTS-Flash is the only TTS that handles Cantonese and Sichuanese without sounding like a customs announcement. After about a year of running these in production for an AI-marketing platform, this series is what I wish someone had handed me on day one.

AI Agents Complete Guide: From Theory to Industrial Practice

Mon, 19 Jan 2026 09:00:00 +0000

A chatbot answers questions. An agent gets things done — it browses, runs code, calls APIs, queries databases, and iterates until the job is complete. The same LLM powers both, but the wrapper differs: an agent runs in a loop with tools, memory, and the ability to inspect its own work.

This guide is the expanded version of that idea. It covers the four core capabilities (planning, memory, tool use, reflection), major framework families, multi-agent collaboration, evaluation, and the production concerns that determine whether an agent succeeds or fails.

Recommendation Systems (12): Large Language Models and Recommendation

Sat, 03 Jan 2026 09:00:00 +0000

A user opens a movie app and types: “Something like Inception, but less depressing.” A traditional recommender — collaborative filtering, two-tower DNN, even DIN — sees zero useful tokens here. It has no like button to count, no co-watch graph to traverse, no user ID with history. The query has to be turned into IDs before the system can do anything.

A Large Language Model has the opposite problem: it has too much world knowledge but doesn’t know who this user is. It knows Inception is a Christopher Nolan film with non-linear narrative and a hopeful-but-ambiguous ending; it knows what “depressing” means in cinema; it can name twenty films that fit. But it can’t tell you which of those twenty the current user has already seen, rated badly, or left half-watched.

NLP (12): Frontiers and Practical Applications

Tue, 25 Nov 2025 09:00:00 +0000

We have spent eleven chapters climbing from raw text to multimodal foundation models. This twelfth and final chapter sits at the frontier and at the runway. It is where research stops being a paper and starts being a service: an LLM that calls tools, writes and debugs code, reasons through hundred-step problems, ingests a 200K-token contract, and serves a thousand concurrent users behind a FastAPI endpoint with p95 latency under 300 ms.

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It can’t read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect but factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update anytime. Before generating, retrieve the relevant evidence and condition the model on it.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

NLP (8): Model Fine-tuning and PEFT

Wed, 05 Nov 2025 09:00:00 +0000

In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.’s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.’s QLoRA (NeurIPS 2023).

The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to “have a model.” Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.

NLP (7): Prompt Engineering and In-Context Learning

Fri, 31 Oct 2025 09:00:00 +0000

The same model can produce a sharp answer or a confident hallucination. The difference lies in the framing, not the weights. A vague request like “analyze this text” yields a generic summary; a prompt with a role, two clear examples, and a strict output schema produces something a parser can use. Prompt engineering turns that gap into a repeatable system, not just a lucky shot.

In-Context Learning (ICL) is the mechanism that makes this work. When you include a few examples in the prompt, the model doesn’t retrain; it conditions its forward pass on those examples and effectively infers a task from them. Understanding ICL’s capabilities and limitations separates a developer who struggles with the model from one who guides it.

Prompt Engineering Complete Guide: From Zero to Advanced Optimization

Tue, 30 Sep 2025 09:00:00 +0000

The same model, two prompts: one achieves 17% accuracy on grade-school math, the other 78%. The difference isn’t magic—it’s prompt engineering. This guide covers the techniques that work, the research behind them, and how to systematically optimize prompts for production.

What You Will Learn#

Foundations — zero-shot, few-shot, many-shot, task decomposition, and the five-block prompt skeleton.
Reasoning techniques — Chain-of-Thought, Self-Consistency, Tree of Thoughts, Graph of Thoughts, ReAct.
Automation — Automatic Prompt Engineering (APE), DSPy, LLMLingua compression.
Practical templates — structured output, code generation, data extraction, multi-turn chat.
Evaluation and debugging — metrics, A/B testing, error analysis, the failure-mode toolkit.

Prerequisites. Basic Python; experience calling any LLM API. No math background required.

LLM Workflows and Application Architecture: Enterprise Implementation Guide

Thu, 31 Jul 2025 09:00:00 +0000

Most LLM tutorials end where the interesting work begins. They show you how to call a chat completion endpoint, attach a vector store, and wrap the whole thing in a Streamlit demo. None of that is wrong, but none of it is what breaks at 3 a.m. when 10,000 users hit your service at once and every other answer is a hallucination.

This article is about everything that comes after the demo. It is opinionated on purpose: production LLM systems are mostly plain distributed systems with one non-deterministic component bolted on, and most of the engineering effort goes into containing that non-determinism. We will work through seven dimensions — application architecture, workflow patterns, the RAG-vs-fine-tune decision, deployment topology, cost, observability, and enterprise integration — keeping each one short, concrete, and grounded in the levers that actually move the needle.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Tue, 29 Jul 2025 09:00:00 +0000

Fine-tuning a 1.5B-parameter GPT-2 model for each downstream task means saving a fresh 1.5B-parameter checkpoint every time. Across a dozen tasks, that is a substantial storage and serving headache, and it makes sharing a single base model essentially impossible. Prefix-Tuning (Li & Liang, 2021) takes the opposite stance: freeze every weight of the language model, and learn a tiny block of continuous vectors — the prefix — that is fed into the attention layers as if it were context the model already attended to. The model never changes; only the prefix does, and a different prefix produces a different “personality” on demand.

MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation

Sun, 01 Sep 2024 09:00:00 +0000

LoRA is the default tool for adapting a frozen base model: cheap, stable, mergeable, and good enough for most single-task settings. But the moment your fine-tuning data is genuinely heterogeneous — code mixed with math, instruction following mixed with creative writing, several domains in one adapter — a single low-rank subspace starts to feel cramped. You can grow $$r$$ , but cost grows with it and you still get one subspace, just a fatter one.

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Fri, 30 Jun 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.

LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

Sun, 22 Jan 2023 09:00:00 +0000

Session-based recommendation relies on the click graph. New items lack edges, and long-tail items have a few noisy ones. Each item has a title and description, but the model never uses them. LLMGR addresses this by treating the LLM as a “semantic engine” that converts text into representations a graph encoder can use, then lets a GNN handle ranking. On Amazon Music/Beauty/Pantry, the results show HR@20 up ~8.68%, NDCG@20 up ~10.71%, and MRR@20 up ~11.75% over the strongest GNN baseline, with the biggest gains for cold-start items.

Optimization (4): Learning Rate and Schedules

Sun, 18 Sep 2022 09:00:00 +0000

Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, learning rate is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.

Optimization (3): The Gradient Descent Family from SGD to AdamW

Fri, 16 Sep 2022 09:00:00 +0000

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

Sat, 09 Apr 2022 09:00:00 +0000

Stuffing pixels, audio, and video into a language model so it can “see,” “hear,” and reason — that was a research curiosity before CLIP landed in 2021. Today it’s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:

Alignment. How does the language model “understand” what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?
Task framing. The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn’t quietly drop several points of accuracy.
Cost. A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.
Evaluation. A model that scores 80 on MMBench can still hallucinate confidently on your customer’s invoice. Public benchmarks are the easy part.

This post follows the natural research arc — architecture, model families, downstream tasks, fine-tuning, evaluation, deployment — and tries to be specific enough at each stop that you can act on it. Less “what’s possible,” more “what to actually pick.”