<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>BERT on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/bert/</link><description>Recent content in BERT on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 21 Oct 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/bert/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (5): BERT and Pretrained Models</title><link>https://www.chenk.top/en/nlp/bert-pretrained-models/</link><pubDate>Tue, 21 Oct 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/bert-pretrained-models/</guid><description>&lt;p>In October 2018, Google released BERT and broke eleven NLP benchmarks at once. The recipe is almost embarrassingly simple: take a Transformer encoder, train it to predict words that have been randomly hidden using both left and right context, and then fine-tune the same pretrained model for whatever downstream task you have. Before BERT, every task came with its own from-scratch model. After BERT, &amp;ldquo;pretrain once, fine-tune everywhere&amp;rdquo; became the default mental model for the entire field.&lt;/p></description></item><item><title>Transfer Learning (2): Pre-training and Fine-tuning</title><link>https://www.chenk.top/en/transfer-learning/02-pre-training-and-fine-tuning/</link><pubDate>Wed, 07 May 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/02-pre-training-and-fine-tuning/</guid><description>&lt;p>BERT changed NLP overnight. A model pre-trained on Wikipedia and BookCorpus could be fine-tuned on a few thousand labelled examples and beat task-specific architectures that researchers had spent years hand-crafting. The same pattern repeated in vision (ImageNet pre-training, then SimCLR, MAE), in speech (wav2vec 2.0), and in code (Codex). Today, &amp;ldquo;pre-train once, fine-tune everywhere&amp;rdquo; is the default recipe of modern deep learning.&lt;/p>
&lt;p>But &lt;em>why&lt;/em> does pre-training work? When should you freeze layers, when should you LoRA, and how small does your learning rate need to be? This article unpacks both the theory and the engineering practice behind the most successful transfer paradigm we have.&lt;/p></description></item></channel></rss>