Transfer Learning (3): Domain Adaptation

Your autonomous-driving stack works perfectly on sunny California freeways. Then it rains in Seattle. Top-1 accuracy drops from 95% to 70%. The model did not get worse — the data distribution shifted, and your training set never told it what wet asphalt looks like at dusk.

This is the everyday problem of domain adaptation: you have abundant labelled data in one distribution (the source) and unlabelled data in another (the target), and you need the model to perform on the target. This article shows you how, from first-principles theory to a working DANN implementation.

What You Will Learn#

Three flavours of distribution shift — covariate, label, concept — and how each is fixed
The Ben-David bound: why adaptation is possible, and the precise quantity it lets you reduce
DANN: adversarial alignment with the gradient reversal layer, in one backward pass
MMD and CORAL: explicit, non-adversarial distribution-matching losses
Self-training, AdaBN, CycleGAN, ADDA — the rest of the modern toolbox
A complete DANN implementation in PyTorch
A decision tree for picking a method, plus benchmark numbers on Office-31 and DomainNet

Prerequisites: Parts 1–2 of this series, basic familiarity with GAN-style adversarial training.

Three Faces of Distribution Shift#

A domain is a feature space $\mathcal{X}$ with a marginal distribution $$P(X)$$ . A task is a label space $\mathcal{Y}$ with a conditional distribution $P(Y \mid X)$ . Domain adaptation studies what happens when the source and target disagree on one of these.

Setting	Source	Target	Goal
Source domain $\mathcal{D}_S$	many labelled $$(x_i, y_i)$$	—	—
Target domain $\mathcal{D}_T$	—	mostly unlabelled $$x_j$$	learn $f: \mathcal{X} \to \mathcal{Y}$ that works on $\mathcal{D}_T$

The figure is the entire game in one picture: before adaptation, the source-trained boundary slices through empty target space; after adaptation, both domains share a feature manifold and the same boundary works.

Covariate shift — the input distribution moved#

P_S(X) \neq P_T(X), \qquad P_S(Y \mid X) = P_T(Y \mid X)

The labelling rule is unchanged; only what you observe is different. Examples:

A spam filter trained on 2020 email and deployed in 2026: topics drift, but spam is still spam.
CT scans from a Siemens scanner used to evaluate scans from a GE machine: the imaging characteristics differ, but radiologists score them the same way.

\mathbb{E}_{P_T}[\ell(f(X), Y)] = \mathbb{E}_{P_S}\!\left[\frac{P_T(X)}{P_S(X)}\,\ell(f(X), Y)\right].

Estimating densities in high dimensions is hopeless, so practitioners estimate the ratio directly with KLIEP, uLSIF, or a probabilistic classifier (Bayes-optimal classifier between source and target gives you the ratio for free).

Label shift — the prevalence moved#

P_S(Y) \neq P_T(Y), \qquad P_S(X \mid Y) = P_T(X \mid Y)

Class-conditional appearance is unchanged; only base rates differ. Examples:

An ICU model deployed in outpatient clinics where disease prevalence is much lower.
A recommender trained on a young-skewing pilot, deployed across all age cohorts.

Standard fix. Estimate the target prior $$P_T(Y)$$ by EM on unlabelled target data (BBSE / RLLS work well), then rescale each source-trained probability by $$P_T(y) / P_S(y)$$ and renormalise.

Concept shift — the rule itself moved#

P_S(Y \mid X) \neq P_T(Y \mid X)

This is the hard case. “Sick” is positive in a music review and negative in a product review even though the word is identical. With no target labels at all, no method can untangle this — concept shift demands at least a few labelled target examples (the semi-supervised DA setting).

Theory: the Ben-David Bound#

\epsilon_T(h) \;\leq\; \epsilon_S(h) \;+\; \tfrac{1}{2}\, d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_S, \mathcal{D}_T) \;+\; \lambda^{*}.

Term	Meaning	What you can do about it
$\epsilon_S(h)$	source-domain error	train better on the source
$d_{\mathcal{H}\Delta\mathcal{H}}$	symmetric-difference divergence between domains	this is what domain adaptation reduces
$\lambda^{*}$	error of the best joint predictor	irreducible — if it is large, no method will save you

Two takeaways:

Adaptation is bounded by an oracle. If source and target tasks are fundamentally different ( $\lambda^*$ large), you are out of luck — you need new labels, not a fancier loss.
Domain divergence has a tractable proxy. Train a binary classifier to distinguish source from target features. If it gets near 50% accuracy, your features are domain-invariant. This is exactly the mechanism DANN automates.

DANN — Adversarial Alignment in One Backward Pass#

DANN training animation: target features migrating toward source class clusters.

Before vs after DANN: domains merge while classes stay separable.

Domain-Adversarial Neural Network (Ganin et al., 2016) is the most influential adversarial method, and the cleanest implementation of “minimise the domain divergence proxy”.

DANN architecture with Gradient Reversal Layer

Three subnetworks, one shared trunk#

Subnet	Role	Trained on
Feature extractor $$G_f$$	maps $$x$$ to $$f = G_f(x)$$	both domains
Label predictor $$G_y$$	classifies $f \to \hat{y}$	source labels
Domain discriminator $$G_d$$	classifies $f \to$ source/target	both domains

\min_{G_f,\, G_y}\; \max_{G_d}\quad \mathcal{L}_y(G_y \circ G_f) \;-\; \lambda\, \mathcal{L}_d(G_d \circ G_f).

$$G_d$$ wants to tell the domains apart; $$G_f$$ wants to fool $$G_d$$ while still letting $$G_y$$ classify the source correctly.

The Gradient Reversal Layer (GRL)#

\text{forward: }\; \text{GRL}(x) = x, \qquad \text{backward: }\; \frac{\partial\,\text{GRL}}{\partial x} = -\lambda\, I.

GRL sits on the path from features to the domain head. During backprop, the discriminator’s gradient flips sign before reaching $$G_f$$ , so the same loss.backward() call:

updates $$G_y$$ to classify better (normal gradients),
updates $$G_d$$ to discriminate better (normal gradients),
updates $$G_f$$ to confuse $$G_d$$ (reversed gradients on the domain term) while still helping $$G_y$$ .

No alternating training, no separate optimisers, no manual freezing.

The adversarial weight schedule#

\lambda_p = \frac{2}{1 + \exp(-\gamma p)} - 1, \qquad \gamma \approx 10,

where $p \in [0, 1]$ is training progress. Early on ( $\lambda \approx 0$ ), the network just learns good source features. As training proceeds ( $\lambda \to 1$ ), domain alignment kicks in. Skipping this schedule is the single most common cause of “DANN trains but does worse than source-only”.

MMD — Matching Means in an RKHS#

Adversarial alignment is powerful but unstable. The non-adversarial alternative is to define an explicit distance between distributions and minimise it directly. Maximum Mean Discrepancy (Gretton et al., 2012) is the standard choice.

Maximum Mean Discrepancy: kernel mean embeddings

The idea#

\mu_P = \mathbb{E}_{X \sim P}[\phi(X)] \;\in\; \mathcal{H}.

\text{MMD}^2(P_S, P_T) = \|\mu_{P_S} - \mu_{P_T}\|_{\mathcal{H}}^2.

The figure shows this graphically: even when raw histograms overlap a little, the kernel mean embeddings make the gap explicit, and the shaded area is exactly $\text{MMD}^2$ .

The estimator you actually compute#

\widehat{\text{MMD}}^2 = \frac{1}{n_s^2}\sum_{i,j} k(x_i^s, x_j^s) + \frac{1}{n_t^2}\sum_{i,j} k(x_i^t, x_j^t) - \frac{2}{n_s n_t}\sum_{i,j} k(x_i^s, x_j^t).

\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda \cdot \widehat{\text{MMD}}^2\!\big(G_f(X_S),\, G_f(X_T)\big).

This is DAN / DDC (Long et al., 2015; Tzeng et al., 2014).

Practical tips#

Use multi-kernel MMD. A mixture $k = \sum_u \beta_u k_{\sigma_u}$ of Gaussian RBFs at several bandwidths is robust to bandwidth misspecification.
Median heuristic for $\sigma$ . Set the bandwidth to the median pairwise distance in the batch — cheap, robust, almost always good enough.
Apply MMD to deeper layers. Lower layers carry domain-specific texture; the abstraction at the top is what you want aligned.

MMD vs DANN at a glance#

	MMD	DANN
Distance	Kernel-based RKHS norm	Jensen–Shannon (via discriminator)
Optimisation	Direct minimisation	Adversarial minimax (GRL)
Stability	Very stable	Sometimes oscillates
Expressiveness	Tied to kernel choice	More flexible
Best when	Small/medium gap, less data	Large gap, abundant data

A reasonable default workflow: try MMD first; switch to DANN if MMD plateaus.

CORAL — Aligning Second-Order Statistics#

If matching means is good, matching means and covariances is often better. CORAL (Sun & Saenko, 2016) does exactly this.

\mathcal{L}_{\text{CORAL}} = \frac{1}{4 d^2} \|C_S - C_T\|_F^2.

Intuition — whitening + recolouring. Multiplying the source features by $C_S^{-1/2} C_T^{1/2}$ first removes the source’s covariance fingerprint, then paints on the target’s. Deep CORAL just adds the loss above to a deep network and lets the gradients do the same job implicitly.

CORAL is dirt cheap (one matrix and one Frobenius norm per batch), entirely deterministic, and surprisingly competitive on mild shifts. It is a great baseline before reaching for MMD or DANN.

AdaBN — The Free Lunch You Should Always Try First#

The simplest domain adaptation trick of all: recompute batch-norm statistics on the target.

Standard BN at test time uses the running mean and variance accumulated during source training. If the target has a different distribution, those statistics are wrong, and they sit between every conv layer and the next non-linearity. AdaBN (Li et al., 2017):

Train normally on source.
With weights frozen, run forward passes over unlabelled target data and recompute $\mu_T, \sigma_T^2$ for every BN layer.
At deployment, swap source statistics for target ones.

Cost: minutes. Code change: replacing a few BatchNorm running stats. Effect: routinely reclaims 2–10 points of accuracy under covariate shift. Always try this first before any fancier method.

GAN-Based and Pixel-Level Adaptation#

Sometimes the gap is so visual — synthetic to real, day to night — that aligning features is too late. You want to translate the inputs themselves.

CycleGAN learns two generators $G: \mathcal{X}_S \to \mathcal{X}_T$ and $F: \mathcal{X}_T \to \mathcal{X}_S$ subject to cycle consistency $F(G(x)) \approx x$ . Translate source images into target style, then train your classifier on the translated images with the original source labels. Beware: cycle consistency does not guarantee semantic preservation; combine with a perceptual or identity loss for safety.
ADDA decouples the source and target encoders. Stage 1: train a source encoder + classifier normally. Stage 2: initialise a target encoder from the source, then adapt it adversarially against a domain discriminator while keeping the classifier frozen. Stage 3: at test time, route target inputs through the target encoder and the source classifier. This asymmetry gives ADDA more capacity than DANN at the cost of an extra training stage.

Self-Training — Bootstrapping Labels on the Target#

Adversarial and statistical alignment treat the target as one undifferentiated cloud. Self-training (also called pseudo-labelling) goes further: it uses your current model to produce target labels and then trains on them.

The loop is:

Train $$f$$ on the source.
Predict on every target sample; keep only those where $\max_y f(x)_y > \tau$ (a high confidence threshold).
Treat the kept (input, prediction) pairs as new labelled data and retrain.
Iterate.

Self-training is powerful and underestimated, but it has one infamous failure mode: confirmation bias. Wrong but confident predictions get re-fed into training and amplified. The standard mitigations are:

a high threshold $\tau$ (typically 0.9+),
class-balanced selection (cap the number kept per class),
consistency regularisation under augmentations (FixMatch-style),
restarting from the source model at each round rather than from the previous self-trained one.

Decision Tree — Which Method, When?#

In practice a strong pipeline often combines methods: AdaBN for the easy gains, MMD or DANN for feature alignment, then a self-training round for the last few points.

Benchmarks — How Much Does This Actually Help?#

The numbers are representative literature averages with a ResNet-50 backbone. Two things worth noticing:

The biggest jump is from “nothing” to “anything”. Even AdaBN closes a meaningful chunk of the gap. Doing something matters far more than choosing the perfect method.
DomainNet is genuinely harder than Office-31. A 40% accuracy on DomainNet still represents a strong method — the dataset has 345 classes across 6 wildly different visual styles. Always interpret DA accuracies relative to a source-only baseline, not in absolute terms.

Where Domain Adaptation Earns Its Keep#

Medical imaging — Siemens vs GE scanners, 1.5T vs 3T MRI, hospital A vs hospital B.
Autonomous driving — sunny to rainy, city A to city B, simulation to real.
Recommendation — country to country, year to year, web to mobile.
NLP — movie reviews to product reviews, news to social, formal to informal.
Sim-to-real — synthetic data to real sensor data in robotics and self-driving.

The common pattern: source labels are abundant, target labels are expensive or impossible, and the model has to ship anyway.

Visualising the Effect — t-SNE Before and After#

A standard sanity check after training a DA model: project source and target features through t-SNE. Before adaptation, samples cluster by domain; after, they cluster by class.

If your “after” plot still shows two domain blobs, alignment failed. If it shows one blob with class structure, alignment worked. This single picture is more diagnostic than any single number.

Complete Implementation: DANN#

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torch.autograd import Function
import numpy as np
from sklearn.metrics import accuracy_score

class GradientReversalFunction(Function):
    """Identity in the forward pass, negates the gradient in the backward pass."""

    @staticmethod
    def forward(ctx, x, lambda_):
        ctx.lambda_ = lambda_
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output.neg() * ctx.lambda_, None

class GradientReversalLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.lambda_ = 1.0

    def set_lambda(self, val):
        self.lambda_ = val

    def forward(self, x):
        return GradientReversalFunction.apply(x, self.lambda_)

class FeatureExtractor(nn.Module):
    def __init__(self, input_dim=28 * 28, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.5),
        )

    def forward(self, x):
        return self.net(x.view(x.size(0), -1))

class LabelPredictor(nn.Module):
    def __init__(self, feature_dim=256, num_classes=10):
        super().__init__()
        self.fc = nn.Linear(feature_dim, num_classes)

    def forward(self, x):
        return self.fc(x)

class DomainDiscriminator(nn.Module):
    def __init__(self, feature_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(feature_dim, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 1), nn.Sigmoid(),
        )

    def forward(self, x):
        return self.net(x)

class DANN(nn.Module):
    """Domain-Adversarial Neural Network."""

    def __init__(self, input_dim=28 * 28, hidden_dim=256, num_classes=10):
        super().__init__()
        self.feature_extractor = FeatureExtractor(input_dim, hidden_dim)
        self.label_predictor = LabelPredictor(hidden_dim, num_classes)
        self.domain_discriminator = DomainDiscriminator(hidden_dim)
        self.grl = GradientReversalLayer()

    def forward(self, x, alpha=1.0):
        features = self.feature_extractor(x)
        class_logits = self.label_predictor(features)
        self.grl.set_lambda(alpha)
        domain_logits = self.domain_discriminator(self.grl(features))
        return class_logits, domain_logits

class DANNTrainer:
    def __init__(self, model, source_loader, target_loader, test_loader,
                 num_epochs=100, lr=1e-3, device="cpu", gamma=10.0):
        self.model = model.to(device)
        self.source_loader = source_loader
        self.target_loader = target_loader
        self.test_loader = test_loader
        self.num_epochs = num_epochs
        self.device = device
        self.gamma = gamma
        self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
        self.class_criterion = nn.CrossEntropyLoss()
        self.domain_criterion = nn.BCELoss()

    def _adaptive_lambda(self, epoch):
        # Sigmoid ramp from 0 -> 1 across training.
        p = epoch / self.num_epochs
        return 2.0 / (1.0 + np.exp(-self.gamma * p)) - 1.0

    def train_epoch(self, epoch):
        self.model.train()
        source_iter = iter(self.source_loader)
        target_iter = iter(self.target_loader)
        n_batches = min(len(self.source_loader), len(self.target_loader))
        total_loss = 0.0
        lambda_p = self._adaptive_lambda(epoch)

        for _ in range(n_batches):
            try:
                src_x, src_y = next(source_iter)
            except StopIteration:
                source_iter = iter(self.source_loader)
                src_x, src_y = next(source_iter)
            try:
                tgt_x, _ = next(target_iter)
            except StopIteration:
                target_iter = iter(self.target_loader)
                tgt_x, _ = next(target_iter)

            src_x = src_x.to(self.device)
            src_y = src_y.to(self.device)
            tgt_x = tgt_x.to(self.device)

            # Forward — both heads, both domains.
            src_class_logits, src_dom_logits = self.model(src_x, lambda_p)
            _, tgt_dom_logits = self.model(tgt_x, lambda_p)

            # Source classification loss.
            class_loss = self.class_criterion(src_class_logits, src_y)
            # Domain discrimination loss (source = 1, target = 0).
            d_loss_s = self.domain_criterion(
                src_dom_logits, torch.ones_like(src_dom_logits))
            d_loss_t = self.domain_criterion(
                tgt_dom_logits, torch.zeros_like(tgt_dom_logits))
            domain_loss = d_loss_s + d_loss_t

            loss = class_loss + domain_loss
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            total_loss += loss.item()

        return total_loss / n_batches

    @torch.no_grad()
    def evaluate(self):
        self.model.eval()
        preds, labels = [], []
        for x, y in self.test_loader:
            x = x.to(self.device)
            logits, _ = self.model(x, alpha=0.0)
            preds.extend(logits.argmax(dim=1).cpu().numpy())
            labels.extend(y.numpy())
        return accuracy_score(labels, preds)

    def train(self):
        best = 0.0
        for epoch in range(self.num_epochs):
            loss = self.train_epoch(epoch)
            acc = self.evaluate()
            if (epoch + 1) % 10 == 0:
                lam = self._adaptive_lambda(epoch)
                print(f"epoch {epoch + 1:3d}  loss={loss:.4f}  "
                      f"target_acc={acc:.4f}  lambda={lam:.3f}")
            best = max(best, acc)
        print(f"best target accuracy: {best:.4f}")

def main():
    N, D, C = 10000, 28 * 28, 10
    # Simulated source and target with distribution shift.
    src_x = torch.randn(N, 1, 28, 28)
    src_y = torch.randint(0, C, (N,))
    tgt_x = torch.randn(N, 1, 28, 28) + 0.5     # shifted
    tgt_y = torch.randint(0, C, (N,))           # not used in training
    test_x = torch.randn(2000, 1, 28, 28) + 0.5
    test_y = torch.randint(0, C, (2000,))

    BS = 128
    src_loader = DataLoader(TensorDataset(src_x, src_y), BS, shuffle=True)
    tgt_loader = DataLoader(TensorDataset(tgt_x, tgt_y), BS, shuffle=True)
    test_loader = DataLoader(TensorDataset(test_x, test_y), BS)

    model = DANN(D, 256, C)
    trainer = DANNTrainer(model, src_loader, tgt_loader, test_loader,
                          num_epochs=100, lr=1e-3)
    trainer.train()

if __name__ == "__main__":
    main()

How this code works#

Component	Role
`GradientReversalLayer`	Identity forward, negated-gradient backward — turns the minimax into a single backward pass.
`_adaptive_lambda`	Sigmoid ramp $\frac{2}{1 + e^{-\gamma p}} - 1$ — start small so the network learns features first.
`class_loss`	Standard cross-entropy on source labels only (no target labels used).
`domain_loss`	BCE: source = 1, target = 0 — trains the discriminator.
GRL + domain head	Reversed gradients flow back to $$G_f$$ → it learns to hide the domain.
`evaluate(alpha=0)`	At test time we set $\lambda = 0$ ; the GRL is irrelevant — only the classification head is used.

CORAL vs MMD vs DANN: Empirical Comparison#

The three alignment losses look different on paper but solve the same problem — pull source and target features into the same region of representation space. To make the trade-offs concrete, fix a single benchmark and run all three with the same backbone.

Setup. Office-31, Amazon $\to$ Webcam. Source $$D_S = 2817$$ labelled images across 31 classes; target $$D_T = 795$$ unlabelled images. ResNet-50 ImageNet-pretrained, last block fine-tuned, 256-d bottleneck before the classifier. Batch 32 source + 32 target, SGD with momentum 0.9, base lr $10^{-3}$ , 50 epochs. The only thing that changes between runs is the alignment loss attached to the bottleneck.

CORAL#

\mathcal{L}_{\text{CORAL}} = \frac{1}{4 d^2} \|C_S - C_T\|_F^2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def coral_loss(fs, ft):
    # fs, ft : (B, d) bottleneck features for source and target.
    d = fs.size(1)
    fs_c = fs - fs.mean(0, keepdim=True)
    ft_c = ft - ft.mean(0, keepdim=True)
    cs = (fs_c.t() @ fs_c) / (fs.size(0) - 1)
    ct = (ft_c.t() @ ft_c) / (ft.size(0) - 1)
    return ((cs - ct) ** 2).sum() / (4 * d * d)

def train_step_coral(model, src_x, src_y, tgt_x, opt, lam=1.0):
    fs = model.bottleneck(model.backbone(src_x))
    ft = model.bottleneck(model.backbone(tgt_x))
    logits = model.classifier(fs)
    ce = F.cross_entropy(logits, src_y)
    align = coral_loss(fs, ft)
    loss = ce + lam * align
    opt.zero_grad(); loss.backward(); opt.step()
    return ce.item(), align.item()

MMD (multi-kernel)#

\widehat{\text{MMD}}^2 = \tfrac{1}{n_s^2}\!\sum k(x_i^s,x_j^s) + \tfrac{1}{n_t^2}\!\sum k(x_i^t,x_j^t) - \tfrac{2}{n_s n_t}\!\sum k(x_i^s,x_j^t).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def mk_mmd2(fs, ft, sigmas=(1, 2, 4, 8, 16)):
    def gram(a, b):
        d2 = ((a[:, None, :] - b[None, :, :]) ** 2).sum(-1)
        return sum(torch.exp(-d2 / (2 * s * s)) for s in sigmas)
    Kss = gram(fs, fs); Ktt = gram(ft, ft); Kst = gram(fs, ft)
    return Kss.mean() + Ktt.mean() - 2 * Kst.mean()

def train_step_mmd(model, src_x, src_y, tgt_x, opt, lam=1.0):
    fs = model.bottleneck(model.backbone(src_x))
    ft = model.bottleneck(model.backbone(tgt_x))
    logits = model.classifier(fs)
    ce = F.cross_entropy(logits, src_y)
    align = mk_mmd2(fs, ft)
    loss = ce + lam * align
    opt.zero_grad(); loss.backward(); opt.step()
    return ce.item(), align.item()

DANN#

\min_{G_f, G_y}\, \max_{G_d}\; \mathcal{L}_y - \lambda\, \mathcal{L}_d.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def train_step_dann(model, disc, grl, src_x, src_y, tgt_x, opt, lam_p):
    grl.set_lambda(lam_p)
    fs = model.bottleneck(model.backbone(src_x))
    ft = model.bottleneck(model.backbone(tgt_x))
    logits = model.classifier(fs)
    ce = F.cross_entropy(logits, src_y)
    d_s = disc(grl(fs)); d_t = disc(grl(ft))
    y_s = torch.ones_like(d_s); y_t = torch.zeros_like(d_t)
    d_loss = F.binary_cross_entropy(d_s, y_s) + F.binary_cross_entropy(d_t, y_t)
    loss = ce + d_loss
    opt.zero_grad(); loss.backward(); opt.step()
    return ce.item(), d_loss.item()

Results#

Method	Target Acc	Time / run	Hyperparam sensitivity
Source-only (no DA)	68.2%	8 min	—
CORAL	76.0%	12 min	Low
MMD multi-kernel	78.4%	18 min	Med (kernel bandwidth)
DANN	80.1%	22 min	High (GRL schedule)

CORAL has nothing to tune beyond $\lambda$ , which mostly does not matter — anything in $$[0.1, 10]$$ gives within a point of the best. MMD is sensitive to the bandwidth set; the multi-kernel variant fixes most of that, but still rewards a sweep. DANN is the most sensitive of the three — get the $\lambda$ ramp wrong and you do worse than source-only.

Convergence behaviour#

CORAL and MMD both reach plateau within roughly 30 epochs and stay there. The loss curves are monotone, the target accuracy curve is monotone too, and you can early-stop on the source validation set with confidence.

DANN looks different. The classification loss and the domain loss fight each other, so target accuracy oscillates by 2–4 points across late epochs. You need the sigmoid ramp on $\lambda$ to keep the early epochs sane, and even then the right strategy is to track target accuracy on a tiny held-out set if you have one — picking the last epoch is often a mistake.

Practical recipe#

Start with CORAL. Zero hyperparameters worth sweeping, deterministic, twelve lines of code. If this closes most of the gap, ship it.
If CORAL plateaus, move to multi-kernel MMD. One additional knob (the bandwidth set), still stable, usually 1–3 points better.
Only reach for DANN when adversarial training is operationally feasible — you have the budget for multiple runs to find the right $\lambda$ schedule, and you can monitor a small target-validation signal.

The honest summary: complexity buys you a few points of accuracy, not an order of magnitude. If those points matter (medical, autonomous driving), go DANN. If they do not, CORAL will pay for itself in debugging time saved.

Bridge: knowing which alignment loss to use presupposes that you know what kind of shift you are dealing with. The next section gives you the diagnostic.

Detecting Which Type of Shift Occurred#

The three shift types — covariate $$P(X)$$ , label $$P(Y)$$ , concept $P(Y \mid X)$ — call for different remedies, and the wrong remedy can hurt more than it helps. Importance weighting fixes covariate shift but does nothing for concept shift. Prior correction fixes label shift but is irrelevant if the inputs themselves moved. Before reaching for a method, run the diagnostic.

Algorithm 1 — covariate shift via a domain classifier#

Train a binary classifier to distinguish source inputs (label 1) from target inputs (label 0). Hold out a portion for validation and read the AUC.

AUC near 0.5: source and target inputs are indistinguishable — no covariate shift to correct.
AUC near 1.0: the input distributions are very different — large covariate shift, importance weights or feature alignment are warranted.
Anything in between is a graded signal — the classifier’s predictions on source samples give you the density-ratio estimate $$w(x) = P_T(x) / P_S(x) = (1 - p) / p$$ where $p = P(\text{source} \mid x)$ .

Algorithm 2 — label shift via prior comparison#

\hat P_T(y) = \frac{1}{n_T} \sum_{j=1}^{n_T} \mathbb{1}[\arg\max_y f(x_j^t) = y].

\mathrm{KL}(P_S(Y) \,\|\, \hat P_T(Y)) = \sum_y P_S(y) \log \frac{P_S(y)}{\hat P_T(y)}.

A KL above $\sim$ 0.05 is suspicious; above 0.2 is a strong signal of label shift. (For rigour, BBSE / RLLS deconvolve the source-model confusion matrix to recover an unbiased estimate of $$P_T(Y)$$ — for a quick diagnostic, the noisy version is enough.)

Algorithm 3 — concept shift via per-class confidence#

Concept shift is the sneakiest of the three: the inputs look the same, the prevalence is the same, but the labelling rule has changed. The signature is confident-but-wrong predictions on the target.

If you have even a tiny labelled target slice (50–100 examples per class), compute mean predicted confidence per class on that slice and compare against the corresponding slice’s accuracy. Big confidence with low accuracy on a class is the fingerprint of concept shift on that class.

A diagnostic helper#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn.functional as F
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

@torch.no_grad()
def _features(model, loader, device):
    feats, labels, probs = [], [], []
    for batch in loader:
        x = batch[0].to(device)
        y = batch[1] if len(batch) > 1 else None
        f = model.bottleneck(model.backbone(x)).cpu()
        p = F.softmax(model.classifier(f.to(device)), dim=1).cpu()
        feats.append(f); probs.append(p)
        if y is not None: labels.append(y)
    return (torch.cat(feats), torch.cat(probs),
            torch.cat(labels) if labels else None)

def diagnose_shift(source_model, source_loader, target_loader,
                   target_labelled_loader=None, device="cpu"):
    source_model.eval()
    fs, ps, ys = _features(source_model, source_loader, device)
    ft, pt, _ = _features(source_model, target_loader, device)

    # 1. Covariate shift — domain classifier AUC.
    X = torch.cat([fs, ft]).numpy()
    d = torch.cat([torch.ones(len(fs)), torch.zeros(len(ft))]).numpy()
    clf = LogisticRegression(max_iter=1000).fit(X, d)
    auc = roc_auc_score(d, clf.predict_proba(X)[:, 1])
    covariate = max(0.0, 2 * (auc - 0.5))         # 0 = none, 1 = max

    # 2. Label shift — KL between source prior and predicted target prior.
    C = ps.size(1)
    p_src = torch.bincount(ys, minlength=C).float() / len(ys)
    yhat_t = pt.argmax(dim=1)
    p_tgt = torch.bincount(yhat_t, minlength=C).float() / len(yhat_t)
    eps = 1e-8
    kl = (p_src * ((p_src + eps).log() - (p_tgt + eps).log())).sum().item()
    label = min(1.0, kl / 0.2)                    # normalise to [0,1]

    # 3. Concept shift — confidence vs accuracy gap on labelled target.
    concept = None
    if target_labelled_loader is not None:
        ftl, ptl, ytl = _features(source_model, target_labelled_loader, device)
        conf, pred = ptl.max(dim=1)
        acc = (pred == ytl).float().mean().item()
        gap = max(0.0, conf.mean().item() - acc)
        concept = min(1.0, gap / 0.3)

    return {"covariate": covariate, "label": label, "concept": concept,
            "auc": auc, "kl": kl}

A toy numerical example#

Synthetic 2-d data, three scenarios, same diagnostic run on each:

Scenario	AUC (cov)	KL (label)	Conf-Acc (concept)	Diagnosis
Clean Gaussians, same labels	0.51	0.01	0.02	no shift
Target shifted by $$+1$$ in $$x_1$$	0.94	0.04	0.03	covariate
Target class prior $$[0.1, 0.9]$$ vs source $$[0.5, 0.5]$$	0.52	0.41	0.04	label
Decision boundary flipped on target	0.50	0.02	0.38	concept

Each shift type lights up exactly one column. Mixed shifts light up several — and the magnitudes tell you the order in which to fix them.

Bridge: with the diagnosis in hand, you can pick the method, but every method that uses pseudo-labels — self-training, FixMatch, joint training with target predictions — depends on those labels being trustworthy. That requires calibrated confidence, which the next section addresses.

Confidence Calibration Under Domain Shift#

Self-training and most semi-supervised DA methods filter pseudo-labels by confidence: keep $(x, \hat y)$ when $\max_y f(x)_y > \tau$ . The implicit assumption is that high softmax confidence implies high accuracy. Under domain shift this assumption breaks.

A source-trained model on the target domain is typically over-confident — it assigns 95% probability to predictions that are right only 70% of the time. The expected calibration error (ECE) on Webcam after Amazon training routinely lands at 18–22%. Filter pseudo-labels with $\tau = 0.9$ and you keep a sea of confident wrong labels — the textbook recipe for confirmation bias.

Two complementary fixes, in increasing order of nuance.

Fix 1 — temperature scaling#

\hat p_y = \frac{\exp(z_y / T)}{\sum_{y'} \exp(z_{y'} / T)}.

$$T > 1$$ softens overconfident predictions; $$T < 1$$ sharpens underconfident ones. Optimisation is one-dimensional and convex — L-BFGS converges in a handful of iterations.

Fix 2 — focal loss on pseudo-labels#

When you train on filtered pseudo-labels, weight each by $(1 - \hat p)^\gamma$ — the focal-loss trick. High-confidence pseudo-labels (the over-confident ones most likely to be wrong) get a small weight; medium-confidence pseudo-labels (where the model still has signal) get the full gradient.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import torch.nn.functional as F

def calibrate_temperature(model, val_loader, device="cpu", max_iter=50):
    """Learn a single scalar T on a small target validation set."""
    model.eval()
    logits_list, labels_list = [], []
    with torch.no_grad():
        for x, y in val_loader:
            f = model.bottleneck(model.backbone(x.to(device)))
            logits_list.append(model.classifier(f).cpu())
            labels_list.append(y)
    logits = torch.cat(logits_list); labels = torch.cat(labels_list)

    T = torch.nn.Parameter(torch.ones(1) * 1.5)
    opt = torch.optim.LBFGS([T], lr=0.1, max_iter=max_iter)

    def closure():
        opt.zero_grad()
        loss = F.cross_entropy(logits / T.clamp(min=1e-2), labels)
        loss.backward()
        return loss
    opt.step(closure)
    return float(T.detach().clamp(min=1e-2))

def filter_and_weight(logits, T, tau=0.9, gamma=2.0):
    """Calibrated filtering + focal weighting for pseudo-labels."""
    p = F.softmax(logits / T, dim=1)
    conf, yhat = p.max(dim=1)
    keep = conf > tau
    w = (1.0 - conf[keep]) ** gamma
    return yhat[keep], w, keep

Numerical effect#

On the same Amazon $\to$ Webcam setup as before, with a 100-example target-validation split:

Stage	ECE on target	Self-training F1
Source-only logits	18.4%	74.6
+ temperature scaling ( $T \approx 2.1$ )	4.1%	76.9
+ focal weighting on pseudo-labels	3.8%	77.8

Calibration alone reclaims 2.3 F1 of downstream self-training accuracy; focal weighting on top adds another 0.9. The ECE drop from 18.4% to 4.1% is the more important number — it tells you that confidences now mean what they say, and a $\tau = 0.9$ filter actually selects $\sim$ 90%-correct labels rather than $\sim$ 70%-correct ones.

Bridge: that completes the practical toolkit — diagnose the shift, pick an alignment method, calibrate before you pseudo-label. The summary that follows distils the whole pipeline into a checklist.

Summary#

Domain adaptation tackles the most practical problem in transfer learning: training data and deployment data come from different distributions. The toolkit, in roughly increasing order of effort:

AdaBN — recompute batch-norm statistics on target; free, no retraining, always try first.
CORAL — match source and target covariance matrices; cheap, deterministic.
MMD (DAN) — match kernel mean embeddings; stable, principled, multi-kernel default.
DANN — adversarial domain alignment via the gradient reversal layer; one backward pass.
CDAN / ADDA — more flexible variants for larger gaps.
CycleGAN — pixel-level translation when feature alignment is not enough.
Self-training — pseudo-labels with a confidence gate; the last few points of accuracy.

The Ben-David bound tells you what is possible: shrink the source error and the domain divergence, and target error follows — as long as the joint optimal error is small. If it is not, no amount of alignment will help; you need labels.

Next: Part 4 — Few-Shot Learning , where we drop the assumption of abundant source data altogether and learn from a handful of examples per class.

References#

Ganin et al. (2016). Domain-Adversarial Training of Neural Networks. JMLR. arXiv:1505.07818
Long et al. (2015). Learning Transferable Features with Deep Adaptation Networks. ICML. arXiv:1502.02791
Sun & Saenko (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. ECCV. arXiv:1607.01719
Zhu et al. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN). ICCV. arXiv:1703.10593
Tzeng et al. (2017). Adversarial Discriminative Domain Adaptation (ADDA). CVPR. arXiv:1702.05464
Long et al. (2018). Conditional Adversarial Domain Adaptation (CDAN). NeurIPS. arXiv:1705.10667
Ben-David et al. (2010). A Theory of Learning from Different Domains. Machine Learning.
Li et al. (2017). Revisiting Batch Normalization for Practical Domain Adaptation (AdaBN). arXiv:1603.04779
Gretton et al. (2012). A Kernel Two-Sample Test (MMD). JMLR. paper
Lipton et al. (2018). Detecting and Correcting for Label Shift with Black Box Predictors. ICML. arXiv:1802.03916
Sohn et al. (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. NeurIPS. arXiv:2001.07685