LLM Engineering (11): Safety and Alignment

Mon, 06 Apr 2026 09:00:00 +0000

Safety has the worst signal-to-noise ratio of any topic in LLM engineering. There’s a lot of philosophy, a lot of marketing, and not a lot of engineering specifics. This chapter is the engineering specifics: what RLHF actually optimizes when it talks about “safety,” how refusal calibration breaks, what red-teaming looks like in practice, the hallucination measures that actually predict customer impact, and the small but significant 2024-2026 papers (Sleeper Agents, refusal as a feature direction, weak-to-strong generalization) that should change how you think about alignment in production.

Constitutional-Ai on Chen Kai Blog

LLM Engineering (11): Safety and Alignment