LLM Engineering (10): Evaluation

Sun, 05 Apr 2026 09:00:00 +0000

Evaluation is the part of the LLM stack where everyone has opinions but no one is confident. The leaderboards are gamed, the public benchmarks are contaminated, and most teams I’ve worked with had no eval set when I joined. This chapter covers what evaluation actually tells you, what the benchmarks hide, the LLM-as-judge biases that go unaddressed, the calibration metrics most teams skip, and the production patterns that catch regressions before customers notice.

Benchmarks on Chen Kai Blog

LLM Engineering (10): Evaluation