<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Benchmarks on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/benchmarks/</link><description>Recent content in Benchmarks on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 05 Apr 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/benchmarks/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Engineering (10): Evaluation</title><link>https://www.chenk.top/en/llm-engineering/10-evaluation/</link><pubDate>Sun, 05 Apr 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/llm-engineering/10-evaluation/</guid><description>&lt;p>Evaluation is the part of the LLM stack where everyone has opinions but no one is confident. The leaderboards are gamed, the public benchmarks are contaminated, and most teams I&amp;rsquo;ve worked with had no eval set when I joined. This chapter covers what evaluation actually tells you, what the benchmarks hide, the LLM-as-judge biases that go unaddressed, the calibration metrics most teams skip, and the production patterns that catch regressions before customers notice.&lt;/p></description></item></channel></rss>