LLM

May 7, 2026 Alibaba Cloud Full Stack 24 min read

Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer

The complete LLM toolkit on Alibaba Cloud: Qwen model family, DashScope API (OpenAI-compatible), Wanxiang image/video generation, Qwen TTS, async task patterns, and fine-tuning. Build a multi-modal AI pipeline.

Apr 7, 2026 LLM Engineering 36 min read

LLM Engineering (12): Production — Deployment, Monitoring, Cost

Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model routing, FrugalGPT cascading, observability you need from day one, and the on-call patterns that work.

Apr 6, 2026 LLM Engineering 36 min read

LLM Engineering (11): Safety and Alignment

What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper agents, refusal as a feature vector, constitutional AI, and what shipping safely actually requires in …

Apr 5, 2026 LLM Engineering 36 min read

LLM Engineering (10): Evaluation

Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.

Apr 4, 2026 LLM Engineering 40 min read

LLM Engineering (9): Prompting at Production Scale

Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, prompt-injection defenses, and the prompts that survive in production.

Apr 3, 2026 LLM Engineering 30 min read

LLM Engineering (8): Retrieval-Augmented Generation

Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, and the failure modes that show up at 100K+ documents.

Apr 2, 2026 LLM Engineering 34 min read

LLM Engineering (7): Function Calling and Tool Use

JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery patterns, and the agent loops that survive contact with reality.

Apr 1, 2026 LLM Engineering 34 min read

LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for streaming, and why 1M-context claims often fail at retrieval.

Mar 31, 2026 LLM Engineering 42 min read

LLM Engineering (5): Inference Optimization

KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the vLLM vs SGLang vs TensorRT-LLM tradeoffs.

Mar 30, 2026 LLM Engineering 48 min read

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT debate, and the production post-training recipes that ship in 2026.

Mar 29, 2026 LLM Engineering 42 min read

LLM Engineering (3): Pretraining at Scale

Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.

Mar 28, 2026 LLM Engineering 18 min read

LLM Engineering (2): Tokenization Deep Dive

BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the chat-template tokens that silently shape every model's behavior.

Mar 27, 2026 LLM Engineering 56 min read

LLM Engineering (1): Architectures from Transformer to MoE

MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.

Mar 22, 2026 Terraform Agents 36 min read

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Centralise LLM API access through one gateway: per-agent quotas, request logging, and zero secrets outside KMS. Terraform-provisioned API Gateway plus self-hosted LiteLLM on ECS, with DashScope/OpenAI/Anthropic keys …

Mar 7, 2026 Aliyun PAI 26 min read

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain

Submit a real multi-GPU training job on PAI-DLC, understand the resource pools (Lingjun vs general vs preemptible), and use AIMaster + EasyCKPT so a flaky node doesn't cost you a day.

Feb 26, 2026 Aliyun Bailian 12 min read

Aliyun Bailian (2): The Qwen LLM API in Production

Picking a Qwen variant by latency and cost, function calling done right, JSON mode without tears, and the enable_thinking + streaming requirement that the docs gloss over.

Feb 25, 2026 Aliyun Bailian 10 min read

Aliyun Bailian (1): Platform Overview and First Request

A practitioner's tour of Alibaba Cloud Bailian (DashScope) — what's actually in the model catalog, the two endpoint flavors, the async task pattern, and a working sample request to ground the rest of the series.

Jan 19, 2026 Standalone 46 min read

AI Agents Complete Guide: From Theory to Industrial Practice

A practitioner-grade guide to building AI agents: planning (CoT/ReAct/ToT), memory architectures, tool use, reflection, multi-agent patterns, frameworks (LangChain, LangGraph, AutoGen, CrewAI), evaluation, and production …

Jan 3, 2026 Recommendation Systems 40 min read

Recommendation Systems (12): Large Language Models and Recommendation

How LLMs reshape recommendation: enhancers (P5, M6Rec), predictors (TallRec, GenRec), and agents (LlamaRec, ChatREC). Hybrid pipelines, cold-start wins, prompt design, and the cost/quality Pareto frontier.

Nov 25, 2025 NLP 36 min read

NLP (12): Frontiers and Practical Applications

Series finale: agents and tool use (Function Calling, ReAct), code generation (Code Llama, Codex), long-context attention (Longformer, Infini-attention), reasoning models (o1, R1), safety and alignment, evaluation, and …

Nov 20, 2025 NLP 32 min read

NLP (11): Multimodal Large Language Models

A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …

Nov 15, 2025 NLP 34 min read

NLP (10): RAG and Knowledge Enhancement Systems

Build production-grade RAG systems from first principles: the retrieve-then-generate decomposition, vector indexes (FAISS / Milvus / Chroma / Weaviate / Pinecone), dense+sparse hybrid retrieval with RRF, cross-encoder …

Nov 10, 2025 NLP 32 min read

NLP (9): Deep Dive into LLM Architecture

Inside modern LLMs: pre-norm + RMSNorm + SwiGLU + RoPE + GQA, KV cache mechanics, FlashAttention's IO-aware schedule, sparse Mixture-of-Experts, and INT8 / INT4 quantization.

Nov 5, 2025 NLP 18 min read

NLP (8): Model Fine-tuning and PEFT

A deep dive into Parameter-Efficient Fine-Tuning. Why LoRA's low-rank update works, the math and memory accounting behind QLoRA, how Adapters and Prefix-Tuning differ, and how to choose between them in production.

Oct 31, 2025 NLP 36 min read

NLP (7): Prompt Engineering and In-Context Learning

From prompt anatomy to chain-of-thought, self-consistency and ReAct: a working theory of in-context learning, the variance you have to fight, and the patterns that scale to real systems.

Sep 30, 2025 Standalone 30 min read

Prompt Engineering Complete Guide: From Zero to Advanced Optimization

Master prompt engineering from zero-shot basics to Tree of Thoughts, DSPy, and automated optimization. Includes benchmarks, code, and a debugging toolkit.

Jul 31, 2025 Standalone 30 min read

LLM Workflows and Application Architecture: Enterprise Implementation Guide

From a single API call to a production LLM platform — workflow patterns, RAG, model routing, deployment, cost levers, observability, and enterprise integration, with the trade-offs that actually matter.

Jul 29, 2025 Standalone 22 min read

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-Tuning adapts frozen LLMs by learning continuous key/value vectors injected into attention. Covers the method, reparameterization, KV-cache mechanics, and comparisons with prompt tuning, adapters, and LoRA.

Sep 1, 2024 Standalone 26 min read

MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation

MoSLoRA boosts LoRA expressivity by mixing multiple low-rank subspaces with a lightweight mixer. Covers when vanilla LoRA fails, mixer design choices, and tuning tips.

Jun 30, 2023 Standalone 12 min read

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.

Jan 22, 2023 Standalone 24 min read

LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

LLMGR uses an LLM as the semantic engine for session-based recommendation and a GNN as the ranker. Covers the hybrid encoding layer, two-stage prompt tuning, ~8.68% HR@20 lift, and how to deploy without running an LLM …

Sep 18, 2022 Optimization Theory 40 min read

Optimization (4): Learning Rate and Schedules

A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …

Sep 16, 2022 Optimization Theory 24 min read

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …

Apr 9, 2022 Standalone 40 min read

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

End-to-end map of multimodal LLMs: vision-language alignment, cross-modal fusion, the CLIP/BLIP/LLaVA families, downstream tasks (VQA, captioning, grounding, OCR), fine-tuning trade-offs, benchmarks, and what it takes to …