Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

In 2026, why are you still reading about kernel methods? Aren’t transformers supposed to have eaten the entire ML stack? Yes and no. Transformers eat the headlines, but kernels still eat the corners — the regimes with 200 samples, the regimes where the model has to publish calibrated error bars, the regimes where a physicist needs to know which basis function caused the prediction. This final part is the field manual: when kernels actually win, how to debug them when they don’t, and how to bolt them on top of a neural network when you want the best of both worlds.

Two paths converging into a deep kernel learning gateway

The previous seven parts climbed a tall mountain: linear ceiling (Part 1 ), positive-definite kernels (Part 2 ), the RKHS (Part 3 ), the kernel catalogue (Part 4 ), the kernel algorithms (Part 5 ), the Gaussian process worldview (Part 6 ), and the large-scale escape hatches (Part 7 ). This part is the descent. We come down into the messy field of decisions a practitioner actually makes — and we end with a 5-step flowchart and a concept map that ties the whole series together.

Deep Kernel Learning (DKL)#

The two communities — kernel people and deep-learning people — used to talk past each other. Kernels could not learn representations from raw pixels; deep nets could not give calibrated uncertainty without expensive hacks. Deep kernel learning, due to Wilson, Hu, Salakhutdinov, and Xing in 2016, is the conceptual bridge.

The DKL kernel is just function composition:

k_{DKL}(x, x') = k_{base}\bigl(g_\theta(x),\, g_\theta(x')\bigr).

Here $g_\theta : \mathcal{X} \to \mathbb{R}^d$ is any parametric feature extractor — typically a small CNN or MLP — and $k_{base}$ is a usual kernel like RBF or Matern. The composition is still positive-definite because PD kernels are closed under composition with arbitrary feature maps (Part 2 , Theorem 4).

DKL architecture: NN encoder feeds into base kernel into GP/SVM head

Why DKL exists. A vanilla RBF kernel on raw 224x224 pixels is useless — Euclidean distance in pixel space is dominated by lighting and translation. A pretrained ResNet feature is much smarter: distance in feature space respects semantic content. DKL says: instead of using fixed pretrained features, learn the feature extractor jointly with the kernel, by backpropagating through the marginal likelihood of a Gaussian process placed on top.

The training objective. For DKL with a GP head, you optimise the negative log marginal likelihood

\mathcal{L}(\theta, \phi) = \tfrac{1}{2} y^\top \bigl(K_{\theta,\phi} + \sigma^2 I\bigr)^{-1} y + \tfrac{1}{2} \log\bigl|K_{\theta,\phi} + \sigma^2 I\bigr| + \tfrac{n}{2}\log 2\pi,

with respect to both the neural network weights $\theta$ and the base-kernel parameters $\phi$ (length scale, signal variance, noise variance). The gradient flows through the kernel evaluation back into the network, exactly like a regular loss flows through a final dense layer. The clever bit is that you only need autodiff plus a Cholesky solver — modern frameworks (GPyTorch, GPflow) make it a 20-line script.

What DKL gets you.

Representation learning from raw inputs: images, audio, text are fair game now.
Calibrated uncertainty: the GP head still gives $\mathcal{N}(\mu(x), \sigma^2(x))$ predictions, so downstream decision systems (Bayesian optimization, active learning, safety-critical control) get the error bars they need.
Sample efficiency: in tasks with $n \sim 10^3$ - $$10^4$$ labelled samples, DKL routinely beats deep ensembles on regression tasks.

The caveats. DKL is not free.

The GP head still costs $$O(n^3)$$ in training and $$O(n^2)$$ in memory. For $n > 5 \cdot 10^4$ you need sparse / inducing-point variants (Part 7 ).
The expressivity of the neural part can swamp the regularisation of the GP part. If $g_\theta$ has 10M parameters and you have 1k labels, the network can learn a representation where every point is far from every other, and the GP degenerates into noise.
Hyperparameter coupling: the length scale of the base kernel and the scale of the encoder output interact. A common failure mode is the encoder collapsing all inputs into a tiny ball.

DKL in practice#

A minimal worked example using GPyTorch, which is the standard library for GP regression with PyTorch backprop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import torch
import gpytorch

class LargeFeatureExtractor(torch.nn.Sequential):
    """A small MLP encoder; replace with a CNN for images."""
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.add_module("l1", torch.nn.Linear(input_dim, 256))
        self.add_module("r1", torch.nn.ReLU())
        self.add_module("l2", torch.nn.Linear(256, 64))
        self.add_module("r2", torch.nn.ReLU())
        self.add_module("l3", torch.nn.Linear(64, output_dim))

class DKLModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, feature_extractor):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.GridInterpolationKernel(
            gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=2)),
            num_dims=2, grid_size=100,
        )
        self.feature_extractor = feature_extractor

    def forward(self, x):
        z = self.feature_extractor(x)
        # Normalise to unit cube to prevent representation collapse.
        z = z - z.min(0)[0]
        z = 2 * (z / z.max(0)[0]) - 1
        mean_x = self.mean_module(z)
        covar_x = self.covar_module(z)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

The training loop is the canonical PyTorch loop — loss.backward() flows gradients into both the GP hyperparameters and the network weights:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
feature_extractor = LargeFeatureExtractor(input_dim=10, output_dim=2)
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = DKLModel(train_x, train_y, likelihood, feature_extractor)

optimizer = torch.optim.Adam([
    {"params": model.feature_extractor.parameters(), "lr": 1e-3},
    {"params": model.covar_module.parameters(), "lr": 1e-2},
    {"params": model.mean_module.parameters(), "lr": 1e-2},
    {"params": model.likelihood.parameters(), "lr": 1e-2},
])

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
model.train(); likelihood.train()
for i in range(200):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

Two practical tips that save the most pain.

1. Lower learning rate for the encoder than for the kernel hyperparameters. The kernel parameters are interpretable and stable; the encoder is over-parametrised and prone to drift. A 10x ratio is a sensible default.

2. Normalise the encoder output explicitly. Without normalisation, the encoder can collapse all inputs into a vanishingly small ball or push them all to infinity. Either failure mode produces a Gram matrix that is either uniformly 1 (rank-1, useless) or uniformly $e^{-\infty}$ (identity, useless). Wrap the encoder output in a tanh or rescale to a unit cube.

Kernels vs deep learning: when to pick which#

The clearest mental model is a 2D grid: training set size versus feature dimensionality. Plot it once and you stop arguing about it.

Recommendation grid for method choice across (n, D) regimes

The grid is opinionated, but five trade-off axes back it up.

Axis 1: data volume. Kernels are $$O(n^2)$$ memory and $$O(n^3)$$ compute. At $$n = 10^3$$ a kernel solve takes milliseconds and beats a deep net trained from scratch on the same data — the kernel’s inductive bias is doing the work, and there is not enough data to train the deep net’s millions of parameters. At $$n = 10^5$$ the kernel solve takes minutes, the deep net takes the same on a GPU, and they perform comparably. At $$n = 10^7$$ the kernel solve is intractable and the deep net is comfortable; this regime belongs to deep learning.

Axis 2: input dimensionality and structure. A 100-dim tabular vector with smooth columns is kernel territory. A 224x224 RGB image is deep territory because pixel distance is meaningless. The intermediate case — 100-dim ResNet features computed for free from a pretrained model — is again kernel territory: a kernel on top of pretrained features often beats fine-tuning the entire network when labelled data is scarce. This is the transfer-learning kernel pattern: pretrain a big model on a huge corpus, freeze it, then put a small kernel head on the frozen features.

Axis 3: uncertainty requirements. This is the kernel community’s permanent moat. A Gaussian process produces a full predictive distribution $\mathcal{N}(\mu(x), \sigma^2(x))$ as a closed-form output of the same equations that produce the mean. Deep networks have to bolt on uncertainty: Monte Carlo dropout, deep ensembles, Bayesian neural networks. All of these work, sort of, but none of them is as clean as the GP. If your downstream system (Bayesian optimization, active learning, safety-critical control, scientific discovery) cares about calibration, start with a GP and only leave it for compelling reasons.

Axis 4: interpretability. A kernel SVM has support vectors — actual training points the model points at and says “this is why I predicted what I predicted”. A linear kernel has weights you can read off. A polynomial kernel has explicit feature interactions you can name. A deep net has a hairball of 10M floats and a SHAP plot if you are lucky. In regulated industries (medical, finance, legal) the gap matters.

Axis 5: theoretical guarantees. Kernels have a full theory: Mercer decomposition, representer theorem, generalization bounds via Rademacher complexity, convergence rates. Deep networks have neural tangent kernel theory (interestingly, itself a kernel theory!) for the infinite-width limit, but for the finite-width networks people actually train, theory is mostly empirical. If you are writing a paper for an applied-math journal, kernels are friendlier.

The two camps are not mutually exclusive — and DKL is the cleanest synthesis. The honest summary: deep learning has won the regime of huge data on raw modalities; kernels have won the regime of moderate data with structure and uncertainty; DKL has won the regime where you want both at once and can pay the GP cost.

Hyperparameter tuning playbook#

Kernel methods have very few hyperparameters compared to deep nets — usually 2 to 5 — but those few are multiplicative, and getting them wrong is catastrophic. Three tuning regimes work in practice, and you pick by problem size and uncertainty appetite.

Cross-validation (the default)#

The workhorse. Pick a 5-fold or 10-fold CV split, sweep hyperparameters on a log grid, score by accuracy (classification) or negative MSE (regression).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svc", SVC(kernel="rbf")),
])
param_grid = {
    "svc__gamma": np.logspace(-4, 2, 13),
    "svc__C": np.logspace(-2, 4, 13),
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring="accuracy")
search.fit(X_train, y_train)
print("best:", search.best_params_, "acc:", search.best_score_)

CV landscape over (log gamma, log C) with grid-search optimum marked

Two non-obvious rules that save the most hours:

Always log-scale. $\gamma$ , $$C$$ , length scale, signal variance — these are all positive multiplicative parameters. A linear grid from $$0.01$$ to $$100$$ spends 99% of its points on values that round to $\geq 1$ . A log grid covers six orders of magnitude with twelve points.

Always inside a Pipeline. If you call scaler.fit(X) once and then run CV on the scaled data, you leak the scale statistics of the test fold into the training fold. The right pattern is Pipeline([scaler, model]) so that scaling is refit inside each fold. The difference between leaky and clean CV is often 1-3% accuracy, which is exactly the gap you would have published as your headline result.

Marginal likelihood (when you have a GP)#

For Gaussian processes, the marginal likelihood — the probability of the training data with the latent function integrated out — provides a closed-form differentiable objective that automatically balances fit and complexity. You do gradient descent on the log marginal likelihood with respect to all kernel hyperparameters at once. No nested CV loop, no grid.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import gpytorch

class GPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel(ard_num_dims=train_x.shape[1])
        )

    def forward(self, x):
        return gpytorch.distributions.MultivariateNormal(
            self.mean_module(x), self.covar_module(x)
        )

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPModel(train_x, train_y, likelihood)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
opt = torch.optim.Adam(model.parameters(), lr=0.1)
for _ in range(100):
    opt.zero_grad()
    loss = -mll(model(train_x), train_y)
    loss.backward()
    opt.step()

The output gives you per-dimension length scales (ARD) — and these per-dimension length scales are themselves an interpretable feature importance ranking: features with small length scales matter most, features with huge length scales are effectively ignored. This is one of the prettier results in classical ML.

Bayesian optimization (when each evaluation is expensive)#

When training one model takes hours (large datasets, large kernels, complex pipelines), grid search wastes evaluations on obviously bad regions. Bayesian optimization fits its own GP to the hyperparameter landscape and uses an acquisition function (expected improvement, UCB) to pick the next point. It is GPs all the way down, and it is genuinely the right tool for the job here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real

bayes = BayesSearchCV(
    pipe,
    {"svc__gamma": Real(1e-4, 1e2, prior="log-uniform"),
     "svc__C":     Real(1e-2, 1e4, prior="log-uniform")},
    n_iter=30, cv=5, n_jobs=-1,
)
bayes.fit(X_train, y_train)

Thirty Bayesian iterations regularly match or beat a hundred grid points. The serial nature of BO is its downside — grid search is embarrassingly parallel, BO is sequential — but on a single machine BO wins.

Fault diagnosis: the four kernel pathologies#

Kernel methods fail in four characteristic ways. If you can tell them apart from the symptoms, you can fix them in minutes; if you cannot, you waste days.

Pathology 1: overfit#

Symptoms. Train accuracy near 1.0; test accuracy collapses. The Gram matrix looks nearly diagonal — every training point only resembles itself.

Causes. $\gamma$ too large (RBF bandwidth too narrow); polynomial degree too high; $$C$$ too large in SVM (regularization too weak); GP noise variance pinned at zero.

Fixes. Halve $\gamma$ until train and test accuracy come together. Cap polynomial degree at 3. Reduce $$C$$ by an order of magnitude. Add a small noise term to the GP. Gather more training data — overfit kernels are starved kernels.

Pathology 2: underfit#

Symptoms. Both train and test accuracy stuck low. The model predicts close to the marginal mean on every input.

Causes. Linear kernel on a nonlinear problem; RBF $\gamma$ way too small; $$C$$ way too small.

Fixes. Climb the expressivity ladder: linear → polynomial → RBF → DKL. Multiply $\gamma$ by 10 until you start to see variation in predictions. Multiply $$C$$ by 10. Verify the kernel actually responds to input differences — print $$K(x_1, x_2)$$ for two different training points and confirm it is not just $$K(x_1, x_1)$$ .

Pathology 3: numerical instability#

Symptoms. GP fit raises “matrix is singular” or “Cholesky failed”. Kernel PCA returns negative eigenvalues. SVM solver does not converge.

Causes. The Gram matrix is rank-deficient or ill-conditioned. Most often: duplicate or near-duplicate training points; a custom kernel that is not actually positive-definite; an extreme $\gamma$ that pushes off-diagonal entries to machine zero.

Fixes.

Add jitter: K = K + 1e-6 * np.eye(n). The single most effective fix.
Standardise features: distance-based kernels collapse when features are on different scales.
Drop near-duplicate rows before fitting.
Use float64. Never fit a GP in float32 unless you know exactly what you are doing.
Verify positive-definiteness of custom kernels: np.all(np.linalg.eigvalsh(K) >= -1e-8).

Pathology 4: training is too slow#

Symptoms. SVM on $$10^4$$ samples runs for hours. GP regression dies past a few thousand samples. Memory blows up before training even starts.

Causes. $$O(n^2)$$ memory and $$O(n^3)$$ compute are not a typo. A $50\,000 \times 50\,000$ kernel matrix in float64 is 20 GB.

Fixes.

Use linear kernels whenever possible: LinearSVC and SGD scale to millions of samples trivially.
Nystrom approximation for nonlinear kernels: sklearn.kernel_approximation.Nystroem (Part 7 ).
Random Fourier features for stationary kernels: explicit $$D$$ -dimensional features, error decays as $1/\sqrt{D}$ .
Sparse / inducing-point GPs (gpytorch.models.ApproximateGP): scale GPs to $$10^5$$ and beyond.
When $n \gtrsim 10^5$ and inputs are raw images / audio / text: switch to deep learning or to DKL.

Diagnosing pathologies from the Gram matrix#

The Gram matrix tells you which pathology you have before any metric does. Eyeball it, plot the eigenvalue spectrum, check the condition number — all three are 5-line tasks.

Four pathological Gram matrices: overfit, underfit, ill-conditioned, healthy

The visual cheatsheet.

Identity-like Gram (bright diagonal, dark off-diagonal): $\gamma$ too large, overfit. Eigenvalue spectrum: all eigenvalues near 1, no decay.
Uniform Gram (everything bright): $\gamma$ too small, underfit. Eigenvalue spectrum: one huge eigenvalue and rest tiny — rank-1 behaviour.
Visible block stripes (rows that look identical to other rows): near-duplicate inputs, ill-conditioned. Eigenvalue spectrum: a sharp cliff to zero.
Smooth gradient with block structure that matches the labels: healthy. Eigenvalue spectrum: smooth geometric decay.

A 10-line diagnostic function pays for itself a hundred times over:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import numpy as np

def diagnose_gram(K, name="K"):
    eigs = np.sort(np.linalg.eigvalsh(K))[::-1]
    eigs = np.clip(eigs, 1e-15, None)
    rank = int(np.sum(eigs > 1e-10))
    cond = eigs[0] / eigs[-1]
    diag_mean = np.diag(K).mean()
    off_mean = (K.sum() - np.trace(K)) / (K.shape[0] * (K.shape[0] - 1))
    print(f"[{name}] shape={K.shape} rank={rank} cond={cond:.2e} "
          f"diag_mean={diag_mean:.3f} off_mean={off_mean:.3f}")
    if off_mean < 0.05 * diag_mean:
        print("  -> Gram is near-diagonal: gamma likely TOO LARGE (overfit).")
    if off_mean > 0.95 * diag_mean:
        print("  -> Gram is near-uniform: gamma likely TOO SMALL (underfit).")
    if cond > 1e10:
        print("  -> Condition number huge: add jitter or drop duplicates.")

Run it before the first model fit, after every hyperparameter change, and especially after a custom-kernel implementation.

The modern relevance of kernels#

Kernels are not legacy — they are an active research frontier in 2026. Four areas where the kernel community is still pushing forward:

Small-data regimes. Many scientific and industrial tasks have $$n < 10^3$$ labelled samples and never will have more — protein property prediction, materials discovery, clinical outcome prediction, semiconductor yield modelling. Deep nets fall apart here; kernels and GPs are the workhorses. The 2020s have seen huge investment in physics-informed kernels that bake in conservation laws.

Bayesian optimization. Every hyperparameter sweep, every neural architecture search, every chemistry optimization that uses optuna, scikit-optimize, BoTorch, or Ax is using a Gaussian process internally. BO is the killer-app of GPs in the deep-learning era — kernels tuning kernels tuning kernels, all the way up to GPT.

Scientific ML. Surrogate models for PDE solvers, climate emulators, and molecular dynamics rely on GPs because they need calibrated uncertainty and they do not have millions of simulations to train on. Deep operator networks and Fourier neural operators are taking some ground here, but GPs remain the bar to beat for sample efficiency.

Neural tangent kernels (NTK). Jacot, Gabriel, and Hongler showed in 2018 that an infinite-width neural network trained with gradient descent behaves exactly like a kernel method with a specific (computable) kernel. This is one of the most surprising results in modern ML theory: the limit of “deep learning” is a kernel method, and the NTK is a window into why deep networks generalise. The practical takeaway is more philosophical than algorithmic — finite networks are not actually NTKs — but the theory is one of the cleanest bridges between the two communities.

Conformal prediction with kernels. Distribution-free uncertainty quantification on top of any base predictor — including kernel models — is the cleanest non-Bayesian way to get calibrated intervals in 2026.

The 5-step kernel decision flowchart#

Compress everything in this series into one flowchart that fits on a single screen.

The five-step kernel method decision flowchart

The flowchart in words:

Step 1: Frame the problem. Is the task linearly separable (does a linear baseline already work)? If yes, ship a linear kernel — fast, interpretable, hard to beat on sparse high-d data like text. If no, continue.

Step 2: Identify data type. Time series with strong seasonality? Use a Periodic kernel summed with an RBF (trend) and a WhiteNoise (residual). The pattern Matern * Periodic + WhiteNoise is the GP-community standard recipe.

Step 3: Fine smoothness control. If you are doing GP regression and you care about the function’s differentiability class, use Matern with $\nu \in \{1/2, 3/2, 5/2\}$ . Default to $\nu = 5/2$ if you have no specific belief — twice continuously differentiable, the sweet spot for most physical signals.

Step 4: High-dim sparse with known interactions. Text, n-grams, gene-gene interactions, GWAS. Start with linear; if you need to model second-order interactions explicitly, use a polynomial kernel of degree 2 or 3 — never higher.

Step 5: Default and scale. RBF kernel with hyperparameters tuned via log-grid cross-validation (or marginal likelihood if GP). Above $n \sim 10^4$ , switch to Nystrom or RFF for SVMs and to sparse / inducing-point GPs for probabilistic models. Above $n \sim 10^6$ on raw modalities, switch to a deep network or to DKL with a deep encoder.

Print the flowchart, tape it to the wall, never argue about kernel choice in code review again.

Series concept map#

The eight parts of this series are not a heap of independent topics — they form a directed graph from theory to practice.

Concept map of the eight kernel-methods series parts

Parts 1-4 (theory and catalogue): what kernels are and why they exist. The linear ceiling motivates the kernel trick; Mercer and the RKHS give it a rigorous home; the kernel catalogue is the actual menu.
Parts 5-7 (algorithms): how to compute with kernels. The representer theorem reduces infinite optimisation to a finite linear algebra problem; GPs lift the picture into Bayesian space; Nystrom and RFF rescue you from $$O(n^3)$$ .
Part 8 (synthesis): how to decide. DKL bridges to deep nets; the decision flowchart turns the menu into a procedure; the concept map ties the eight parts back together.

Three themes thread through all eight:

Theme A: theory drives practice. Positive-definiteness is not a footnote — every algorithm in Parts 5-7 exists because the Gram matrix has the right spectral properties. When you skip the math, you skip the diagnosis.

Theme B: kernels are a modelling decision, not a numerical trick. Choosing RBF over Matern is a statement about the smoothness class your data belongs to. Choosing periodic plus RBF is a statement about the temporal structure. Kernels make these modelling choices explicit; deep nets often hide them inside architecture choices.

Theme C: the frontier is hybrid. DKL, NTK, GP-augmented BO, physics-informed kernels — the most active research mixes kernels with deep learning rather than picking one camp. The next decade of ML will keep blurring the line.

What’s next#

You have finished the eight-part journey. Where to go from here depends on which direction pulled you in.

For Bayesian optimization specifically: Rasmussen and Williams Chapter 5; Frazier’s 2018 tutorial “A Tutorial on Bayesian Optimization” (arXiv:1807.02811); the BoTorch documentation. If you ever tune a deep-learning hyperparameter sweep with Optuna or Ax, you are already doing this — now you know the kernel underneath.

For scientific ML and physics-informed kernels: Raissi, Perdikaris, Karniadakis “Physics-Informed Neural Networks” and the follow-up work on GP-based PDE solvers. The Earth Science community has built large GP emulators of climate models.

For deeper kernel theory: Steinwart and Christmann Support Vector Machines (2008); Berlinet and Thomas-Agnan Reproducing Kernel Hilbert Spaces in Probability and Statistics (2004). These are reference textbooks, not weekend reading, but they are the canonical sources for every theorem we cited.

For NTK and the deep-kernel bridge: Jacot, Gabriel, Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” (NeurIPS 2018); Arora et al. “On Exact Computation with an Infinitely Wide Neural Net” (NeurIPS 2019). These papers are the Rosetta Stone between the two communities.

For practical Gaussian processes: GPyTorch tutorials (the ExactGP, ApproximateGP, DeepKernelLearning examples) and GPflow’s getting-started notebooks. Both are excellent; pick the one that matches your PyTorch or TensorFlow preference.

A challenge. Take one project you currently solve with XGBoost or a small neural network. Reframe it as a kernel problem: pick a kernel using the 5-step flowchart, run the diagnostic, tune on a log grid. Compare. The result will be one of three things — better, worse, or about the same — and in all three cases you will have learned something the rest of your ML practice would not have taught you.

Kernels are not the past of machine learning. They are the part of machine learning that knows what it is doing.

References#

Wilson, Hu, Salakhutdinov, Xing. Deep Kernel Learning (AISTATS, 2016). The DKL paper.
Rasmussen and Williams. Gaussian Processes for Machine Learning (MIT Press, 2006). The GP bible; free PDF.
Hofmann, Scholkopf, Smola. Kernel Methods in Machine Learning (Annals of Statistics, 2008). Comprehensive survey.
Jacot, Gabriel, Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks (NeurIPS, 2018). The NTK paper.
Frazier. A Tutorial on Bayesian Optimization (2018). The canonical introduction to BO.
Gardner, Pleiss, Bindel, Weinberger, Wilson. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration (NeurIPS, 2018). The library underlying every modern GP example here.

This is Part 8 (final) of Kernel Methods (8 parts). Previous: Part 7 — Large-Scale Kernels (Nystrom + RFF)

Series complete — see the full Kernel Methods index .

Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Deep Kernel Learning (DKL)#

DKL in practice#

Kernels vs deep learning: when to pick which#

Hyperparameter tuning playbook#

Cross-validation (the default)#

Marginal likelihood (when you have a GP)#

Bayesian optimization (when each evaluation is expensive)#

Fault diagnosis: the four kernel pathologies#

Pathology 1: overfit#

Pathology 2: underfit#

Pathology 3: numerical instability#

Pathology 4: training is too slow#

Diagnosing pathologies from the Gram matrix#

The modern relevance of kernels#

The 5-step kernel decision flowchart#

Series concept map#

What’s next#

References#

Kernel Methods 8 parts

Liked this piece?

Deep Kernel Learning (DKL)#

DKL in practice#

Kernels vs deep learning: when to pick which#

Hyperparameter tuning playbook#

Cross-validation (the default)#

Marginal likelihood (when you have a GP)#

Bayesian optimization (when each evaluation is expensive)#

Fault diagnosis: the four kernel pathologies#

Pathology 1: overfit#

Pathology 2: underfit#

Pathology 3: numerical instability#

Pathology 4: training is too slow#

Diagnosing pathologies from the Gram matrix#

The modern relevance of kernels#

The 5-step kernel decision flowchart#

Series concept map#

What’s next#

References#

Kernel Methods 8 parts

Liked this piece?

Read next

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Optimization (4): Learning Rate and Schedules