Tags
Evaluation
LLM Engineering (10): Evaluation
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.