Inference

Mar 31, 2026 LLM Engineering 42 min read

LLM Engineering (5): Inference Optimization

KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the vLLM vs SGLang vs TensorRT-LLM tradeoffs.

Mar 8, 2026 Aliyun PAI 26 min read

Aliyun PAI (4): PAI-EAS — Model Serving, Cold Starts, and the TPS Lie

End-to-end PAI-EAS for production: image-based deploy from OSS-mounted weights, the three inference modes, an autoscaler that doesn't blow your budget, and canary releases via service groups. Includes a working vLLM …