<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PEFT on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/peft/</link><description>Recent content in PEFT on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 05 Nov 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/peft/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (8): Model Fine-tuning and PEFT</title><link>https://www.chenk.top/en/nlp/fine-tuning-peft/</link><pubDate>Wed, 05 Nov 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/fine-tuning-peft/</guid><description>&lt;p>In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.&amp;rsquo;s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.&amp;rsquo;s QLoRA (NeurIPS 2023).&lt;/p>
&lt;p>The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to &amp;ldquo;have a model.&amp;rdquo; Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.&lt;/p></description></item><item><title>Prefix-Tuning: Optimizing Continuous Prompts for Generation</title><link>https://www.chenk.top/en/standalone/prefix-tuning/</link><pubDate>Tue, 29 Jul 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/prefix-tuning/</guid><description>&lt;p>Fine-tuning a 1.5B-parameter GPT-2 model for each downstream task means saving a fresh 1.5B-parameter checkpoint every time. Across a dozen tasks, that is a substantial storage and serving headache, and it makes sharing a single base model essentially impossible. &lt;em>Prefix-Tuning&lt;/em> (Li &amp;amp; Liang, 2021) takes the opposite stance: freeze every weight of the language model, and learn a tiny block of continuous vectors — the &lt;em>prefix&lt;/em> — that is fed into the attention layers as if it were context the model already attended to. The model never changes; only the prefix does, and a different prefix produces a different &amp;ldquo;personality&amp;rdquo; on demand.&lt;/p></description></item><item><title>Transfer Learning (9): Parameter-Efficient Fine-Tuning</title><link>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</link><pubDate>Wed, 18 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</guid><description>&lt;p>How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible — and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/transfer-learning/09-parameter-efficient-fine-tuning/illustration_1.png" alt="Transfer Learning (9): Parameter-Efficient Fine-Tuning — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Why the low-rank assumption holds for weight updates&lt;/li>
&lt;li>LoRA: derivation, initialization, scaling, and weight merging&lt;/li>
&lt;li>Adapter: bottleneck architecture and where to insert it&lt;/li>
&lt;li>Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2&lt;/li>
&lt;li>QLoRA: how 4-bit quantisation gets a 65B model on one GPU&lt;/li>
&lt;li>Method comparison and a selection guide grounded in GLUE numbers&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites" class="heading-anchor">Prerequisites&lt;a href="#prerequisites" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Transformer architecture (attention, FFN, residual + LayerNorm)&lt;/li>
&lt;li>Matrix decomposition basics (rank, SVD)&lt;/li>
&lt;li>Transfer learning fundamentals (Parts 1-6)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="the-full-fine-tuning-problem" class="heading-anchor">The Full Fine-Tuning Problem&lt;a href="#the-full-fine-tuning-problem" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;span class="math-block">$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$&lt;/span>
&lt;p>
For GPT-3 (175B params) this means roughly &lt;strong>700 GB of FP32 weights&lt;/strong>, plus gradients, plus optimiser states — and one full copy per task. Even after the model fits, the per-task storage and serving cost is brutal: 100 customers means 100 copies of a 700 GB checkpoint.&lt;/p></description></item><item><title>MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation</title><link>https://www.chenk.top/en/standalone/moslora/</link><pubDate>Sun, 01 Sep 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/moslora/</guid><description>&lt;p>LoRA is the default tool for adapting a frozen base model: cheap, stable, mergeable, and good enough for most single-task settings. But the moment your fine-tuning data is genuinely heterogeneous — code mixed with math, instruction following mixed with creative writing, several domains in one adapter — a single low-rank subspace starts to feel cramped. You can grow &lt;span class="math-inline">$r$&lt;/span>
, but cost grows with it and you still get &lt;em>one&lt;/em> subspace, just a fatter one.&lt;/p></description></item></channel></rss>