LLM
Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer
The complete LLM toolkit on Alibaba Cloud: Qwen model family, DashScope API (OpenAI-compatible), Wanxiang image/video generation, Qwen TTS, async task patterns, and fine-tuning. Build a multi-modal AI pipeline.
LLM Engineering (12): Production — Deployment, Monitoring, Cost
Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model routing, FrugalGPT cascading, observability you need from day one, and the on-call patterns that work.
LLM Engineering (11): Safety and Alignment
What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper agents, refusal as a feature vector, constitutional AI, and what shipping safely actually requires in …
LLM Engineering (10): Evaluation
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.
LLM Engineering (9): Prompting at Production Scale
Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, prompt-injection defenses, and the prompts that survive in production.
LLM Engineering (8): Retrieval-Augmented Generation
Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, and the failure modes that show up at 100K+ documents.
LLM Engineering (7): Function Calling and Tool Use
JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery patterns, and the agent loops that survive contact with reality.
LLM Engineering (6): Long Context — RoPE, YaRN, Sinks
How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for streaming, and why 1M-context claims often fail at retrieval.
LLM Engineering (5): Inference Optimization
KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the vLLM vs SGLang vs TensorRT-LLM tradeoffs.
LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF
What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT debate, and the production post-training recipes that ship in 2026.
LLM Engineering (3): Pretraining at Scale
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.
LLM Engineering (2): Tokenization Deep Dive
BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the chat-template tokens that silently shape every model's behavior.
LLM Engineering (1): Architectures from Transformer to MoE
MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.
Terraform for AI Agents (6): LLM Gateway and Secrets Management
Centralise LLM API access through one gateway: per-agent quotas, request logging, and zero secrets outside KMS. Terraform-provisioned API Gateway plus self-hosted LiteLLM on ECS, with DashScope/OpenAI/Anthropic keys …
Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain
Submit a real multi-GPU training job on PAI-DLC, understand the resource pools (Lingjun vs general vs preemptible), and use AIMaster + EasyCKPT so a flaky node doesn't cost you a day.
Aliyun Bailian (2): The Qwen LLM API in Production
Picking a Qwen variant by latency and cost, function calling done right, JSON mode without tears, and the enable_thinking + streaming requirement that the docs gloss over.
Aliyun Bailian (1): Platform Overview and First Request
A practitioner's tour of Alibaba Cloud Bailian (DashScope) — what's actually in the model catalog, the two endpoint flavors, the async task pattern, and a working sample request to ground the rest of the series.
AI Agents Complete Guide: From Theory to Industrial Practice
A practitioner-grade guide to building AI agents: planning (CoT/ReAct/ToT), memory architectures, tool use, reflection, multi-agent patterns, frameworks (LangChain, LangGraph, AutoGen, CrewAI), evaluation, and production …
Recommendation Systems (12): Large Language Models and Recommendation
How LLMs reshape recommendation: enhancers (P5, M6Rec), predictors (TallRec, GenRec), and agents (LlamaRec, ChatREC). Hybrid pipelines, cold-start wins, prompt design, and the cost/quality Pareto frontier.
NLP (12): Frontiers and Practical Applications
Series finale: agents and tool use (Function Calling, ReAct), code generation (Code Llama, Codex), long-context attention (Longformer, Infini-attention), reasoning models (o1, R1), safety and alignment, evaluation, and …
NLP (11): Multimodal Large Language Models
A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …
NLP (10): RAG and Knowledge Enhancement Systems
Build production-grade RAG systems from first principles: the retrieve-then-generate decomposition, vector indexes (FAISS / Milvus / Chroma / Weaviate / Pinecone), dense+sparse hybrid retrieval with RRF, cross-encoder …
NLP (9): Deep Dive into LLM Architecture
Inside modern LLMs: pre-norm + RMSNorm + SwiGLU + RoPE + GQA, KV cache mechanics, FlashAttention's IO-aware schedule, sparse Mixture-of-Experts, and INT8 / INT4 quantization.
NLP (8): Model Fine-tuning and PEFT
A deep dive into Parameter-Efficient Fine-Tuning. Why LoRA's low-rank update works, the math and memory accounting behind QLoRA, how Adapters and Prefix-Tuning differ, and how to choose between them in production.
NLP (7): Prompt Engineering and In-Context Learning
From prompt anatomy to chain-of-thought, self-consistency and ReAct: a working theory of in-context learning, the variance you have to fight, and the patterns that scale to real systems.
Prompt Engineering Complete Guide: From Zero to Advanced Optimization
Master prompt engineering from zero-shot basics to Tree of Thoughts, DSPy, and automated optimization. Includes benchmarks, code, and a debugging toolkit.
LLM Workflows and Application Architecture: Enterprise Implementation Guide
From a single API call to a production LLM platform — workflow patterns, RAG, model routing, deployment, cost levers, observability, and enterprise integration, with the trade-offs that actually matter.
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-Tuning adapts frozen LLMs by learning continuous key/value vectors injected into attention. Covers the method, reparameterization, KV-cache mechanics, and comparisons with prompt tuning, adapters, and LoRA.
MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation
MoSLoRA boosts LoRA expressivity by mixing multiple low-rank subspaces with a lightweight mixer. Covers when vanilla LoRA fails, mixer design choices, and tuning tips.
Position Encoding Brief: From Sinusoidal to RoPE and ALiBi
A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.
LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation
LLMGR uses an LLM as the semantic engine for session-based recommendation and a GNN as the ranker. Covers the hybrid encoding layer, two-stage prompt tuning, ~8.68% HR@20 lift, and how to deploy without running an LLM …
Optimization (4): Learning Rate and Schedules
A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …
Optimization (3): The Gradient Descent Family from SGD to AdamW
One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …
Multimodal LLMs and Downstream Tasks: A Practitioner's Guide
End-to-end map of multimodal LLMs: vision-language alignment, cross-modal fusion, the CLIP/BLIP/LLaVA families, downstream tasks (VQA, captioning, grounding, OCR), fine-tuning trade-offs, benchmarks, and what it takes to …

























