PEFT on Chen Kai Blog

NLP (8): Model Fine-tuning and PEFT

Wed, 05 Nov 2025 09:00:00 +0000

In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.’s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.’s QLoRA (NeurIPS 2023).

The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to “have a model.” Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Tue, 29 Jul 2025 09:00:00 +0000

Fine-tuning a 1.5B-parameter GPT-2 model for each downstream task means saving a fresh 1.5B-parameter checkpoint every time. Across a dozen tasks, that is a substantial storage and serving headache, and it makes sharing a single base model essentially impossible. Prefix-Tuning (Li & Liang, 2021) takes the opposite stance: freeze every weight of the language model, and learn a tiny block of continuous vectors — the prefix — that is fed into the attention layers as if it were context the model already attended to. The model never changes; only the prefix does, and a different prefix produces a different “personality” on demand.

Transfer Learning (9): Parameter-Efficient Fine-Tuning

Wed, 18 Jun 2025 09:00:00 +0000

How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible — and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.

What You Will Learn#

Why the low-rank assumption holds for weight updates
LoRA: derivation, initialization, scaling, and weight merging
Adapter: bottleneck architecture and where to insert it
Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2
QLoRA: how 4-bit quantisation gets a 65B model on one GPU
Method comparison and a selection guide grounded in GLUE numbers

Prerequisites#

Transformer architecture (attention, FFN, residual + LayerNorm)
Matrix decomposition basics (rank, SVD)
Transfer learning fundamentals (Parts 1-6)

The Full Fine-Tuning Problem#

\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})

For GPT-3 (175B params) this means roughly 700 GB of FP32 weights, plus gradients, plus optimiser states — and one full copy per task. Even after the model fits, the per-task storage and serving cost is brutal: 100 customers means 100 copies of a 700 GB checkpoint.

MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation

Sun, 01 Sep 2024 09:00:00 +0000

LoRA is the default tool for adapting a frozen base model: cheap, stable, mergeable, and good enough for most single-task settings. But the moment your fine-tuning data is genuinely heterogeneous — code mixed with math, instruction following mixed with creative writing, several domains in one adapter — a single low-rank subspace starts to feel cramped. You can grow $$r$$ , but cost grows with it and you still get one subspace, just a fatter one.