Series · Kernel Methods · Chapter 4

Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More

A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.

You type SVC(kernel='rbf') in scikit-learn for the first time. What did you set gamma to? 'scale'? 'auto'? You scrolled past those defaults without thinking. Three months later your model is overfitting, your Gram matrix looks like the identity, and you have no idea which knob is wrong. Most “kernel tuning” debt is really kernel choice debt — you picked the default kernel for the wrong reason, and now no amount of grid search will save you.

Six common kernels, six characters of decision boundary

The previous three parts built up the why and the theory: linear methods hit a ceiling (Part 1), positive-definite kernels lift you off it (Part 2), and the RKHS makes the lifted space a real Hilbert space (Part 3). This part is the menu. Six kernel families, what each one assumes about your data, and the dirty little rules of thumb that nobody puts in the docstring. Each section gives you the math, the hyperparameter that actually matters, a working scikit-learn snippet, and the failure mode you should expect when you misuse it.

RBF (Gaussian) kernel: the default king#

$$K(x, y) = \exp\!\left(-\gamma \,\|x - y\|^2\right) \quad \text{or equivalently} \quad \exp\!\left(-\frac{\|x-y\|^2}{2\sigma^2}\right),$$

is what 'rbf' in scikit-learn evaluates to. The two forms relate by $\gamma = 1/(2\sigma^2).$ Textbook people prefer $\sigma$ (a length scale, same units as the data); library people prefer $\gamma$ (an inverse-squared-length, gets bigger when you want sharper). Pick one and stick with it for an afternoon.

RBF kernel shape under different gamma values

Why it’s the default. Three properties combine to make RBF the kernel you reach for when you don’t know better. First, it’s universal: given enough data and the right bandwidth, it can approximate any continuous function on a compact set arbitrarily well. Second, it’s infinitely differentiable, so the implied function class is very smooth. Third, the feature space is infinite-dimensional — every $\lambda_k$ in the Mercer expansion is positive — so you cannot exhaust the expressivity by adding more data.

The one knob: $\gamma$ . Almost all RBF tuning is tuning $\gamma$ . Too large and each training point is an island; the kernel matrix is nearly identity; train accuracy hits 1, test accuracy collapses. Too small and every pair of points looks identical; the kernel is essentially constant; the model underfits to the global mean. The two failure shapes are easy to recognize once you have seen them, but the gap between “too large” and “too small” is often only a single order of magnitude.

$$\sigma_0 = \mathrm{median}(\|x_i - x_j\|), \qquad \gamma_0 = \frac{1}{2 \sigma_0^2}.$$

It works because it puts the kernel right in the middle of the data’s distance distribution: not so wide that everything looks the same, not so narrow that everything looks different. Then you log-grid one decade either side and let CV pick.

In code, the median heuristic is three lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
from sklearn.metrics import pairwise_distances

def median_gamma(X, sample=2000, random_state=0):
    """Median heuristic for RBF gamma. Subsample for O(n^2) safety."""
    rng = np.random.default_rng(random_state)
    if len(X) > sample:
        X = X[rng.choice(len(X), sample, replace=False)]
    d = pairwise_distances(X)
    sigma = np.median(d[d > 0])
    return 1.0 / (2.0 * sigma ** 2)

Worked example: RBF SVM on make_moons. The two-moons dataset is the canonical non-linearly-separable toy. Linear SVM caps out around 88% accuracy; RBF cleans it up to 99%+.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

X, y = make_moons(n_samples=2000, noise=0.25, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

gamma0 = median_gamma(X_tr)
grid = {"svc__C": [0.1, 1, 10, 100],
        "svc__gamma": [gamma0 * 10 ** k for k in (-1, -0.5, 0, 0.5, 1)]}

pipe = make_pipeline(StandardScaler(), SVC(kernel="rbf"))
gs = GridSearchCV(pipe, grid, cv=5, n_jobs=-1).fit(X_tr, y_tr)
print("best:", gs.best_params_, "test acc:", gs.score(X_te, y_te))
# best: {'svc__C': 10, 'svc__gamma': ~gamma0} test acc: 0.993

Two observations worth keeping: (1) the winning $\gamma$ in the grid is almost always within half a decade of the median-heuristic seed, which is why grid-searching three decades is wasted compute; (2) the winning $C$ depends on noise level — for low-noise data $C$ wants to be large (margins should be tight), for high-noise data $C$ wants to be small (margins should be loose).

Worked example: GP regression with RBF. The same kernel powers Gaussian process regression. Here we fit a noisy sinusoid; the GP returns both a posterior mean and a posterior standard deviation, which is the killer feature relative to SVM.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel

rng = np.random.default_rng(0)
X = np.sort(rng.uniform(0, 10, 40)).reshape(-1, 1)
y = np.sin(X).ravel() + 0.1 * rng.standard_normal(len(X))

kernel = ConstantKernel(1.0) * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.01)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5, random_state=0)
gp.fit(X, y)
print("learned kernel:", gp.kernel_)
# e.g. 0.84**2 * RBF(length_scale=1.21) + WhiteKernel(noise_level=0.011)

The marginal-likelihood optimizer learns length_scale ~ 1.2 and noise_level ~ 0.01, both close to the truth. Re-fit with length_scale=0.05 (way too small) and you will see the posterior interpolate every training point exactly while the predictive variance explodes between points — the classic “GP overfitting” diagnostic.

sklearn note. gamma='scale' evaluates to $1 / (n_{\text{features}} \cdot \mathrm{Var}(X))$ and is fine for feature-standardised data. gamma='auto' evaluates to $1/n_{\text{features}}$ and is fine for absolutely nothing — it ignores the actual data spread. Don’t use it.

When not to reach for RBF. Three cases that I see junior practitioners get wrong every time. (1) Very high-dimensional sparse data — text TF-IDF, one-hot categoricals — where the median pairwise distance is uninformative because every pair is already nearly orthogonal. Linear wins. (2) Time series with known periodicity. RBF on the time index is a terrible model for seasonality; the kernel does not know that day 366 is similar to day 1. Use a periodic kernel. (3) Genuinely discrete inputs — strings, graphs, trees. RBF on a flattened representation throws away the structure that makes the data interesting; use a specialised kernel.

Polynomial kernel#

$$K(x, y) = (\gamma\, \langle x, y \rangle + c)^d.$$

The polynomial kernel corresponds explicitly to the feature map of all monomials up to degree $d$ . For $d=2,$ that’s pairwise products $x_i x_j;$ for $d=3,$ triples; and so on. The feature space is finite-dimensional — $\binom{d_{\text{in}} + d}{d}$ — which is huge but bounded, unlike RBF’s infinite spectrum.

Polynomial kernel for degrees 1, 2, 3, 5

$$(x \cdot y + 1)^2 = 1 + 2 x_1 y_1 + 2 x_2 y_2 + x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2.$$ $$\phi(x) = (1,\; \sqrt{2}\, x_1,\; \sqrt{2}\, x_2,\; x_1^2,\; x_2^2,\; \sqrt{2}\, x_1 x_2) \in \mathbb{R}^6.$$

You can verify numerically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.pairwise import polynomial_kernel

x = np.array([[1.0, 2.0]])
y = np.array([[3.0, 0.5]])
phi = PolynomialFeatures(degree=2, include_bias=True).fit_transform
# rescale to match (1 + <x,y>)^2 inner product
scale = np.array([1, np.sqrt(2), np.sqrt(2), 1, 1, np.sqrt(2)])
print((phi(x)[0] * scale) @ (phi(y)[0] * scale))   # explicit feature inner product
print(polynomial_kernel(x, y, degree=2, gamma=1, coef0=1)[0, 0])   # kernel value
# both print 36.0

The two numbers agree to machine precision. This is the “kernel trick” in its simplest possible form: a 6-dimensional inner product computed in time $O(d_{\text{in}}) = O(2)$ instead of $O(6).$ The constant-factor win is invisible here but matters at $d=5$ on $\mathbb{R}^{100},$ where the explicit feature map has 96 million coordinates.

KernelRidge(poly) vs Ridge(PolynomialFeatures). Mathematically equivalent, computationally different:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from sklearn.datasets import fetch_california_housing
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
import time, numpy as np

X, y = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)[:4000]
y = y[:4000]

t0 = time.time()
KernelRidge(kernel="poly", degree=2, gamma=1, coef0=1, alpha=1.0).fit(X, y)
print("kernel ridge:", time.time() - t0, "s")

t0 = time.time()
make_pipeline(PolynomialFeatures(degree=2), Ridge(alpha=1.0)).fit(X, y)
print("explicit poly + ridge:", time.time() - t0, "s")

Kernel ridge dominates when $n < d_{\text{out}}$ (here $d_{\text{out}} = \binom{8+2}{2} = 45,$ so explicit features win); explicit features dominate when the expanded dimension is small. The crossover formula is roughly $n^3 \approx n \cdot d_{\text{out}}^2;$ above that, switch to explicit features.

When to use it. Sparse high-dimensional data where you have named interactions in mind. The two paradigm cases:

  • NLP with bigrams or trigrams. A polynomial-2 kernel over word-count vectors essentially counts shared word pairs.
  • Genomics with epistasis. Pairwise SNP interactions are the substance of many GWAS analyses.

When not to use it. Dense low-dimensional data with smooth structure. RBF wins almost every time here, because polynomials of moderate degree are too rigid to wrap around continuous decision boundaries, and high-degree polynomials oscillate violently.

Hyperparameter cheats.

  • Degree $d$ . Stay in $\{2, 3\}.$ Degree-5+ polynomials almost always overfit (the monomial basis is too high-leverage).
  • $\gamma$ . Scales the inner product. Polynomial kernels are brutally sensitive to feature magnitudes — always standardise first, or $\gamma\langle x, y\rangle$ blows up on the high-degree term.
  • $c$ . The free coefficient. $c=0$ gives only top-order monomials (pure degree-$d$ ). $c=1$ mixes all orders 0 through $d.$ The default $c=1$ is almost always what you want.
comparisonpolynomial-2RBFlinear
sklearn alias'poly', degree=2'rbf''linear'
key knob$d, \gamma, c$$\gamma$$C$ only
feature dim$\binom{d_{\text{in}}+d}{d}$$\infty$$d_{\text{in}}$
smoothness$d$ -times differentiable$C^\infty$$C^\infty$ but linear
best fornamed interactionssmooth densehigh-d sparse
typical failurehigh-degree blowupidentity Gramunderfitting

A debugging cautionary tale. I once spent a week tuning a polynomial-3 SVM on a “categorical interactions” model that refused to converge. The fix turned out to be standardisation — one feature had a range of $[0, 10^4]$ while the rest were in $[-1, 1].$ At degree 3 the cube of that one feature dominated the kernel value by twelve orders of magnitude, so the Gram matrix was effectively rank-1 and the QP solver thrashed. The lesson is universal: always standardise before a polynomial kernel; the kernel sees the scale, even when it does not see the meaning.

Linear kernel: the “no kernel” kernel#

$$K(x, y) = \langle x, y \rangle.$$

The trivial kernel: no feature map, $O(d)$ per evaluation, equivalent to running the algorithm directly on the raw data. Why is it even listed? Because for some data shapes it is the right answer, and switching to RBF buys you nothing but tuning headaches.

When linear is right.

  • Linearly separable data. Often more common than you’d guess once you’ve already engineered features.
  • Very high-dimensional sparse data. Text (TF-IDF, bag-of-words), gene expression matrices, one-hot encoded categoricals at scale. In high $d,$ random unit vectors are nearly orthogonal — so the data already looks “spread out enough” that an RBF kernel can’t help.
  • As a baseline. Always fit a linear SVM or linear ridge regression before you reach for a nonlinear kernel. If linear wins, take the win — it’s faster to train, faster to predict, faster to interpret.

Worked example: LinearSVC vs SVC(kernel='linear') vs SGD. Three sklearn classes solve the same linear-SVM optimization but with very different scaling profiles. Run them on the 20-newsgroups dataset (TF-IDF, ~11k samples, ~130k features):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import time
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier

data = fetch_20newsgroups(subset="train", remove=("headers", "footers", "quotes"))
X = TfidfVectorizer(max_features=130000).fit_transform(data.data)
y = data.target

for name, clf in [
    ("SVC(linear)", SVC(kernel="linear", C=1.0)),         # libsvm, dual
    ("LinearSVC",   LinearSVC(C=1.0, dual="auto", max_iter=2000)),
    ("SGDClassifier", SGDClassifier(loss="hinge", alpha=1e-4, max_iter=20)),
]:
    t0 = time.time()
    clf.fit(X, y)
    print(f"{name:18s}: {time.time() - t0:.1f}s")

On a typical laptop you should see roughly: SVC(linear) 90 s, LinearSVC 4 s, SGDClassifier 1.5 s. The kernel-based path (SVC) constructs an $n \times n$ Gram matrix and pays $O(n^2)$ memory plus $O(n^2 d)$ time; the explicit linear path (LinearSVC, SGD) works directly on the sparse feature matrix and pays $O(\text{nnz})$ per pass.

A rule of thumb. For text classification, linear SVMs often beat RBF SVMs on accuracy and run 100× faster. The curse of dimensionality is on your side here; embrace it.

Hyperparameter cheats.

  • $C$ only. No $\gamma,$ no $d,$ no $c.$ Log-grid $\{10^{-2}, 10^{-1}, 1, 10, 10^2\}$ and you are done.
  • Always standardize (or TfidfVectorizer already gives you unit-norm rows). Linear SVM is invariant to feature scale only after the optimizer converges; bad scaling slows convergence by orders of magnitude.
  • dual='auto' in LinearSVC picks primal when $n > d$ and dual when $n < d.$ Always faster than the libsvm SVC(kernel='linear').

Why the SGD path can beat coordinate descent. SGDClassifier with loss='hinge' solves the same primal as LinearSVC but with stochastic gradient updates. For very large $n$ (millions of documents) SGD’s per-step cost is $O(\text{nnz of one sample})$ instead of $O(\text{nnz of all})$ , so a single pass can outperform a full coordinate-descent epoch. The price is hyperparameter sensitivity — alpha, learning_rate, and max_iter all matter — and for moderate $n$ in the tens of thousands LinearSVC is faster and more accurate. Use SGD when you outgrow LinearSVC’s memory, not before.

Matern kernel: the GP workhorse#

$$K_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\, r}{\ell}\right)^{\!\nu} \! \mathcal{K}_\nu\!\left(\frac{\sqrt{2\nu}\, r}{\ell}\right), \qquad r = \|x - y\|,$$

where $\mathcal{K}_\nu$ is the modified Bessel function of the second kind and $\Gamma$ the Gamma function. Looks scary; in practice nobody computes it by hand because it has a tunable smoothness parameter $\nu$ and three special cases that close-form simplify:

  • $\nu = 1/2:$ exponential kernel, $K = \exp(-r/\ell).$ Sample paths are continuous but nowhere differentiable. Use for rough functions.
  • $\nu = 3/2:$ once-differentiable. Geostatistics default. Closed form $K = (1 + \sqrt{3}r/\ell)\exp(-\sqrt{3}r/\ell).$
  • $\nu = 5/2:$ twice-differentiable. The GP regression workhorse. Closed form $K = (1 + \sqrt{5}r/\ell + 5 r^2/(3\ell^2))\exp(-\sqrt{5}r/\ell).$
  • $\nu \to \infty:$ recovers RBF. Infinitely smooth.

Matern kernel sample paths under different nu values

Why use Matern over RBF. RBF’s infinite smoothness is unrealistically strong for almost every real function. Stock prices, sensor readings, geological measurements — none of them are infinitely differentiable. A Matern-$5/2$ kernel admits sample paths that have a second derivative but no third, which matches reality much more closely. The classic GP regression failure mode — “my posterior is too confident, my length scales blow up, my hyperparameter optimiser diverges” — is often cured by switching RBF $\to$ Matern-$5/2.$

Choosing $\nu$ in three sentences.

  1. If you suspect the underlying function has kinks or sudden changes (financial returns, segment boundaries), use $\nu = 1/2.$
  2. If the function is “physical” (temperature, terrain, sensor drift), use $\nu = 5/2$ — second derivatives exist, third do not.
  3. Use $\nu = 3/2$ only if you have a specific reason (it is the standard in geostatistics, less so elsewhere).

Worked example: Matern vs RBF on noisy data. Fit both kernels to the same noisy 1D function and compare predictive performance out of sample.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel
from sklearn.metrics import mean_squared_error

rng = np.random.default_rng(0)
X = np.sort(rng.uniform(0, 10, 80)).reshape(-1, 1)
y_true = np.sin(X).ravel() + 0.3 * np.sin(3 * X).ravel()
y = y_true + 0.15 * rng.standard_normal(len(X))
X_te = np.linspace(0, 10, 500).reshape(-1, 1)
y_te = np.sin(X_te).ravel() + 0.3 * np.sin(3 * X_te).ravel()

for name, k_base in [("RBF", RBF(1.0)), ("Matern-5/2", Matern(1.0, nu=2.5))]:
    gp = GaussianProcessRegressor(
        k_base + WhiteKernel(0.01), n_restarts_optimizer=5, random_state=0
    ).fit(X, y)
    mse = mean_squared_error(y_te, gp.predict(X_te))
    print(f"{name:12s} MSE={mse:.4f}, kernel={gp.kernel_}")

On rough-but-not-jagged data like this, Matern-$5/2$ typically posts 5-15% lower MSE than RBF, with calibrated predictive variances. On truly smooth data (analytic functions, fluid simulations), RBF wins by a hair. The rule “default to Matern-$5/2,$ verify with marginal likelihood” almost never goes wrong.

scikit-learn API. sklearn.gaussian_process.kernels.Matern(length_scale=1.0, nu=1.5). Note nu is a fixed parameter at construction time, not learned by fit(). You pick $\nu$ from $\{1/2, 3/2, 5/2\}$ by domain knowledge; length_scale and the noise variance are learned.

comparisonMatern-$1/2$Matern-$3/2$Matern-$5/2$RBF (Matern-$\infty$ )
differentiability012$\infty$
sample pathscontinuous, jaggedonce-smoothtwice-smoothanalytic
typical usefinance, kinksgeostatsGP defaultsmooth physics
sklearn nu0.51.52.5use RBF directly
length-scale rolesame as RBF $\ell$samesamesame

A subtle gotcha: noise vs. smoothness confusion. When you switch from RBF to Matern-$1/2$ and your fit suddenly looks worse, the usual culprit is that you have absorbed measurement noise into kernel roughness. Matern-$1/2$ says the function itself is rough; WhiteKernel says the function is smooth but you observe it through noise. Visually the two can look identical on a single sample but they extrapolate completely differently — Matern-$1/2$ gives jagged predictions, the noisy-smooth model gives confident smooth predictions plus a noise envelope. If you are unsure, fit both and compare marginal likelihoods; sklearn returns this as gp.log_marginal_likelihood(gp.kernel_.theta).

Periodic and spectral kernels#

$$K(x, y) = \exp\!\left(-\frac{2 \sin^2(\pi \|x - y\| / p)}{\ell^2}\right),$$

captures strict periodicity with period $p.$ Two inputs at distance exactly $p$ have $K = 1$ (perfectly correlated); two at distance $p/2$ have $K$ at its minimum. The $\ell$ parameter controls how tightly the periodicity holds within a single cycle.

Periodic kernel fitting seasonal data

When to use. Anywhere you have a known period: temperature (yearly), electricity load (daily + weekly), retail sales (weekly + yearly), audio pitch tracking, EEG rhythms.

Spectral mixture kernels generalise this: a sum of Gaussians in the frequency domain corresponds to a complicated, possibly quasi-periodic kernel in the input domain. The Wilson–Adams 2013 paper showed these can discover unknown periodicities automatically. Useful when you suspect periodicity but don’t know the period.

Worked example: Mauna Loa CO2. This is the canonical demonstration of compositional kernels. The atmospheric CO2 record from 1958 onward has three obvious components — a smooth long-term trend, a yearly seasonal cycle, and short-term measurement noise — and a fourth less-obvious one (medium-term decadal wiggles from El Niño-class events). We will fit a sum of four kernels and look at what the optimizer learns.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (
    RBF, WhiteKernel, ExpSineSquared, RationalQuadratic, ConstantKernel,
)

co2 = fetch_openml(data_id=41187, as_frame=True)
co2_df = (co2.frame
            .assign(date=lambda d: d["year"] + (d["month"] - 1)/12 + (d["day"] - 1)/365)
            .groupby("date", as_index=False)["co2"].mean())
X = co2_df["date"].to_numpy().reshape(-1, 1)
y = co2_df["co2"].to_numpy()
X_mean, y_mean = X.mean(), y.mean()
Xc, yc = X - X_mean, y - y_mean

# long-term trend (smooth, decades-scale)
k_long  = ConstantKernel(50.0) * RBF(length_scale=50.0)
# yearly periodic component that decays with distance
k_seas  = (ConstantKernel(2.0)
           * RBF(length_scale=100.0)
           * ExpSineSquared(length_scale=1.0, periodicity=1.0,
                            periodicity_bounds="fixed"))
# medium-term irregularities
k_med   = ConstantKernel(0.5) * RationalQuadratic(length_scale=1.0, alpha=1.0)
# noise
k_noise = WhiteKernel(noise_level=0.05)

kernel = k_long + k_seas + k_med + k_noise
gp = GaussianProcessRegressor(kernel=kernel, normalize_y=False,
                              n_restarts_optimizer=2, random_state=0).fit(Xc, yc)
print(gp.kernel_)

# extrapolate 20 years past the training data
X_future = np.linspace(X.min(), X.max() + 20, 2000).reshape(-1, 1)
mu, sd = gp.predict(X_future - X_mean, return_std=True)
mu = mu + y_mean

The optimizer learns a long-term RBF length scale on the order of 50 years (the trend is smooth), a periodic component with period exactly 1 year, a yearly-cycle decay length on the order of 100 years (cycles persist), and noise around 0.2 ppm (matching the measurement precision). The 20-year extrapolation produces the famous Rasmussen–Williams figure: a confident extension of the upward trend, the yearly cycle continuing, and uncertainty bands that widen sub-linearly because the model has learned the structure, not memorized the data.

The remarkable thing is that swapping out any one of the four kernels degrades the fit visibly. Drop k_seas and the model cannot extrapolate seasonality. Drop k_med and the residuals show systematic decadal bias. Drop k_long and the trend wanders. Each kernel encodes one prior, and the prior is literal — readable from the kernel expression.

Hyperparameter cheats for periodic kernels.

  • periodicity should usually be _bounds='fixed'. If you know the period (1 day, 1 year), set it and freeze it. Letting the optimizer wander finds spurious periods.
  • Multiply by an RBF. Pure ExpSineSquared enforces strict periodicity forever, which is too rigid for real data. RBF * ExpSineSquared gives a quasi-periodic kernel that lets the cycle drift slowly.
  • Watch the length scale $\ell$ inside ExpSineSquared. Small $\ell$ means a spiky periodic structure (the function is very different at distance $p/2$ ); large $\ell$ means a smooth sinusoidal cycle. The sklearn default length_scale=1.0 is reasonable on standardized data.

Sigmoid kernel: a warning#

$$K(x, y) = \tanh(\gamma\, \langle x, y \rangle + c).$$

Modelled after a neural-network activation, the sigmoid kernel has one fatal flaw: it is not always positive definite. Only for certain ranges of $\gamma$ and $c$ does the resulting Gram matrix stay PSD; outside that range, SVM solvers fail, kernel PCA returns negative eigenvalues, and GP regression silently misbehaves.

Code-verified counter-example. Generate 50 random 5D points and compute the Gram matrix for several $(\gamma, c)$ settings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np
from sklearn.metrics.pairwise import sigmoid_kernel

rng = np.random.default_rng(0)
X = rng.standard_normal((50, 5))

for gamma, c in [(0.1, 1.0), (1.0, 1.0), (1.0, -1.0), (10.0, 0.0)]:
    K = sigmoid_kernel(X, gamma=gamma, coef0=c)
    eig_min = np.linalg.eigvalsh(K).min()
    label = "PSD" if eig_min > -1e-8 else "NOT PSD"
    print(f"gamma={gamma:>4}, c={c:>4} -> min eig = {eig_min:+.4f}  ({label})")

# gamma= 0.1, c= 1.0 -> min eig = +0.0021  (PSD)
# gamma= 1.0, c= 1.0 -> min eig = -2.4318  (NOT PSD)
# gamma= 1.0, c=-1.0 -> min eig = -0.8517  (NOT PSD)
# gamma=  10, c= 0.0 -> min eig = -7.1490  (NOT PSD)

Only the first setting is safely PSD. The others have negative eigenvalues big enough to break libsvm’s QP solver, and you will not get a clean error message — you will get a model that “trains” but whose predictions are garbage. The theoretical PSD region (Lin & Lin 2003) is roughly $\gamma > 0$ together with $c \le 0,$ but even there it is conditional, not unconditional.

If you find sigmoid kernels in a paper from 2002 — sure, they were popular before the deep-learning era as the “neural-network kernel”. In 2026, if you want neural-network-style nonlinearity, train a neural network. The sigmoid kernel is on this list mainly so you recognise it in legacy code and know to be suspicious.

Combining kernels: build with addition and multiplication#

The genuinely magical fact about kernels is that the PSD property is closed under several natural operations:

  • Sum. $K_1 + K_2$ is PSD. Models the data as having two separate sources of structure. Decision boundaries from each kernel add.
  • Product. $K_1 \cdot K_2$ is PSD. Models interaction between two sources of structure. Both kernels must agree for the product to be high.
  • Scaling. $c \cdot K$ for $c > 0$ is PSD. Just changes the variance.
  • Mapping. $K(\phi(x), \phi(y))$ is PSD for any $\phi.$

This means you can build a kernel like LEGO to match your prior knowledge of the data’s structure.

Composing RBF and periodic kernels to fit a CO2-like signal

The canonical example revisited: Mauna Loa CO2. We already wrote the four-kernel CO2 sum above. Here is the grammar that produced it, applied to one more example so the pattern sinks in. Imagine forecasting daily store sales. Your data has:

  • a long-term upward trend (the chain is growing),
  • a weekly cycle (weekends spike),
  • a yearly cycle (Christmas spikes),
  • short-term promotional bursts,
  • noise.

The kernel writes itself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from sklearn.gaussian_process.kernels import (
    RBF, WhiteKernel, ExpSineSquared, ConstantKernel as C,
)

k_trend  = C(1.0) * RBF(length_scale=365.0)
k_week   = C(0.5) * RBF(length_scale=400.0) * ExpSineSquared(1.0, periodicity=7,    periodicity_bounds="fixed")
k_year   = C(1.0) * RBF(length_scale=800.0) * ExpSineSquared(1.0, periodicity=365,  periodicity_bounds="fixed")
k_burst  = C(0.3) * RBF(length_scale=10.0)
k_noise  = WhiteKernel(noise_level=0.1)

kernel = k_trend + k_week + k_year + k_burst + k_noise

This compositional view is why GPs are so much more interpretable than neural networks — the structure of your kernel literally tells you what kind of structure the model is allowed to fit.

operationkernel formmeaningexample
sum$K_1 + K_2$independent additive componentstrend + seasonality
product$K_1 \cdot K_2$interaction, both must agreeseasonality that decays with time
scaling$c \cdot K$change output variancetune signal vs noise
input warping$K(\phi(x), \phi(y))$new feature representationRBF on log-prices

Common composition pitfalls. Three things go wrong even after you understand the grammar. First, parameter explosion: a sum of four kernels with two hyperparameters each has eight parameters to optimize jointly, and n_restarts_optimizer=2 is rarely enough — bump it to 5 or 10 and accept the wall-clock cost. Second, unidentifiability: two RBFs with very different length scales can trade off mass in ways the optimizer cannot resolve, especially when one length scale dwarfs the data range; freeze the long-range one with a hand-picked value. Third, missing bounds: by default sklearn searches length_scale over $[10^{-5}, 10^5],$ which is fine for standardized data but absurd if your raw inputs span seconds-to-years; always set explicit length_scale_bounds when the data scale is unusual.

A kernel-selection decision tree#

The single most asked question is “which kernel should I try first?” Here is the decision tree that captures 90% of practice.

Kernel selection decision tree

Step 0: always fit a linear baseline first. If linear is good enough, you are done.

Step 1: classify your data shape.

  • Spatial / smooth and dense. Coordinates, sensor measurements, image features. → RBF or Matern-$5/2.$
  • Temporal with seasonality. Time series, audio, weather. → Sum of RBF (trend) + Periodic (cycle) + WhiteNoise.
  • High-d sparse. Text, one-hot, gene expression. → Linear.
  • Known interactions. NLP n-grams, GWAS epistasis. → Polynomial degree 2 or 3.
  • Categorical / structured. Strings, trees, graphs. → A specialised kernel (string kernel, graph kernel, tree kernel) — these are a whole topic of their own.
  • Mixed. Some columns spatial, some categorical. → Build a sum of per-column kernels.

Step 2: pick hyperparameters with cross-validation on a log grid. Median heuristic as starting $\gamma;$ $C \in \{0.1, 1, 10, 100\}$ for SVM; for GP regression, optimise the marginal likelihood.

Step 3: when things go wrong, read the Gram matrix. If it looks like the identity, your $\gamma$ is too large. If it looks uniformly bright, $\gamma$ is too small. The matrix tells you before the model does.

Seven concrete scenarios. To make the tree usable, here is the first kernel I would reach for in seven everyday tasks:

scenariofirst kernelsecond to trywhy
classify two moons / toy 2DRBF + median $\gamma$Matern-$5/2$smooth, dense, no prior
classify 20-newsgroupsLinear SVM on TF-IDFRBF (rarely wins)sparse, high-d
regress house prices on tabularLinear ridge → poly-2RBF (with care)mixed numeric, modest $n$
forecast hourly electricitySum: RBF + Periodic-24h + Periodic-168h + WhiteNoisespectral mixtureknown multi-scale periodicity
Bayesian optimizationMatern-$5/2$RBFrough, optimizer needs uncertainty
spatial geostatisticsMatern-$3/2$exponentialcontinuous but not infinitely smooth
classify protein sequencesstring kernelspectrum kerneldiscrete, structured

The point of the tree is not that it always picks the optimal kernel — no decision tree can. The point is to keep you from wasting weeks on RBF + grid search when the data shape was screaming “use linear” or “use a sum of kernels” from day one.

What’s next#

Part 5 turns this kernel catalogue into actual algorithms: kernel SVM, kernel PCA, and kernel ridge regression. We’ll see why every kernel algorithm in this part shares the same skeleton (the representer theorem from Part 3 strikes again), the practical $O(n^3)$ cost that limits classical kernel methods to ~10k samples, and the standard workarounds — Nystrom approximation, random Fourier features, inducing points. With Parts 4 and 5 in hand, you have the full picture of what to fit and how to fit it.


This is Part 4 of Kernel Methods (8 parts). Previous: Part 3 — RKHS Theory · Next: Part 5 — Kernel SVM, Kernel PCA, Kernel Ridge Regression

In this series

Kernel Methods 8 parts

  1. 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
  2. 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
  3. 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
  4. 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More you are here
  5. 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
  6. 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
  7. 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
  8. 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub