<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Parameter Efficiency on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/parameter-efficiency/</link><description>Recent content in Parameter Efficiency on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 18 Jun 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/parameter-efficiency/index.xml" rel="self" type="application/rss+xml"/><item><title>Transfer Learning (9): Parameter-Efficient Fine-Tuning</title><link>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</link><pubDate>Wed, 18 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</guid><description>&lt;p>How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible — and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/transfer-learning/09-parameter-efficient-fine-tuning/illustration_1.png" alt="Transfer Learning (9): Parameter-Efficient Fine-Tuning — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Why the low-rank assumption holds for weight updates&lt;/li>
&lt;li>LoRA: derivation, initialization, scaling, and weight merging&lt;/li>
&lt;li>Adapter: bottleneck architecture and where to insert it&lt;/li>
&lt;li>Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2&lt;/li>
&lt;li>QLoRA: how 4-bit quantisation gets a 65B model on one GPU&lt;/li>
&lt;li>Method comparison and a selection guide grounded in GLUE numbers&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites" class="heading-anchor">Prerequisites&lt;a href="#prerequisites" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Transformer architecture (attention, FFN, residual + LayerNorm)&lt;/li>
&lt;li>Matrix decomposition basics (rank, SVD)&lt;/li>
&lt;li>Transfer learning fundamentals (Parts 1-6)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="the-full-fine-tuning-problem" class="heading-anchor">The Full Fine-Tuning Problem&lt;a href="#the-full-fine-tuning-problem" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;span class="math-block">$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$&lt;/span>
&lt;p>
For GPT-3 (175B params) this means roughly &lt;strong>700 GB of FP32 weights&lt;/strong>, plus gradients, plus optimiser states — and one full copy per task. Even after the model fits, the per-task storage and serving cost is brutal: 100 customers means 100 copies of a 700 GB checkpoint.&lt;/p></description></item></channel></rss>