
Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.
You type SVC(kernel='rbf') in scikit-learn for the first time. What did you set gamma to? 'scale'? 'auto'? You scrolled past those defaults without thinking. Three months later your model is overfitting, your Gram matrix looks like the identity, and you have no idea which knob is wrong. Most “kernel tuning” debt is really kernel choice debt — you picked the default kernel for the wrong reason, and now no amount of grid search will save you.

The previous three parts built up the why and the theory: linear methods hit a ceiling (Part 1), positive-definite kernels lift you off it (Part 2), and the RKHS makes the lifted space a real Hilbert space (Part 3). This part is the menu. Six kernel families, what each one assumes about your data, and the dirty little rules of thumb that nobody puts in the docstring. Each section gives you the math, the hyperparameter that actually matters, a working scikit-learn snippet, and the failure mode you should expect when you misuse it.
RBF (Gaussian) kernel: the default king#
$$K(x, y) = \exp\!\left(-\gamma \,\|x - y\|^2\right) \quad \text{or equivalently} \quad \exp\!\left(-\frac{\|x-y\|^2}{2\sigma^2}\right),$$is what 'rbf' in scikit-learn evaluates to. The two forms relate by $\gamma = 1/(2\sigma^2).$
Textbook people prefer $\sigma$
(a length scale, same units as the data); library people prefer $\gamma$
(an inverse-squared-length, gets bigger when you want sharper). Pick one and stick with it for an afternoon.

Why it’s the default. Three properties combine to make RBF the kernel you reach for when you don’t know better. First, it’s universal: given enough data and the right bandwidth, it can approximate any continuous function on a compact set arbitrarily well. Second, it’s infinitely differentiable, so the implied function class is very smooth. Third, the feature space is infinite-dimensional — every $\lambda_k$ in the Mercer expansion is positive — so you cannot exhaust the expressivity by adding more data.
The one knob: $\gamma$ . Almost all RBF tuning is tuning $\gamma$ . Too large and each training point is an island; the kernel matrix is nearly identity; train accuracy hits 1, test accuracy collapses. Too small and every pair of points looks identical; the kernel is essentially constant; the model underfits to the global mean. The two failure shapes are easy to recognize once you have seen them, but the gap between “too large” and “too small” is often only a single order of magnitude.
$$\sigma_0 = \mathrm{median}(\|x_i - x_j\|), \qquad \gamma_0 = \frac{1}{2 \sigma_0^2}.$$It works because it puts the kernel right in the middle of the data’s distance distribution: not so wide that everything looks the same, not so narrow that everything looks different. Then you log-grid one decade either side and let CV pick.
In code, the median heuristic is three lines:
| |
Worked example: RBF SVM on make_moons. The two-moons dataset is the canonical non-linearly-separable toy. Linear SVM caps out around 88% accuracy; RBF cleans it up to 99%+.
| |
Two observations worth keeping: (1) the winning $\gamma$ in the grid is almost always within half a decade of the median-heuristic seed, which is why grid-searching three decades is wasted compute; (2) the winning $C$ depends on noise level — for low-noise data $C$ wants to be large (margins should be tight), for high-noise data $C$ wants to be small (margins should be loose).
Worked example: GP regression with RBF. The same kernel powers Gaussian process regression. Here we fit a noisy sinusoid; the GP returns both a posterior mean and a posterior standard deviation, which is the killer feature relative to SVM.
| |
The marginal-likelihood optimizer learns length_scale ~ 1.2 and noise_level ~ 0.01, both close to the truth. Re-fit with length_scale=0.05 (way too small) and you will see the posterior interpolate every training point exactly while the predictive variance explodes between points — the classic “GP overfitting” diagnostic.
sklearn note. gamma='scale' evaluates to $1 / (n_{\text{features}} \cdot \mathrm{Var}(X))$
and is fine for feature-standardised data. gamma='auto' evaluates to $1/n_{\text{features}}$
and is fine for absolutely nothing — it ignores the actual data spread. Don’t use it.
When not to reach for RBF. Three cases that I see junior practitioners get wrong every time. (1) Very high-dimensional sparse data — text TF-IDF, one-hot categoricals — where the median pairwise distance is uninformative because every pair is already nearly orthogonal. Linear wins. (2) Time series with known periodicity. RBF on the time index is a terrible model for seasonality; the kernel does not know that day 366 is similar to day 1. Use a periodic kernel. (3) Genuinely discrete inputs — strings, graphs, trees. RBF on a flattened representation throws away the structure that makes the data interesting; use a specialised kernel.
Polynomial kernel#
$$K(x, y) = (\gamma\, \langle x, y \rangle + c)^d.$$The polynomial kernel corresponds explicitly to the feature map of all monomials up to degree $d$ . For $d=2,$ that’s pairwise products $x_i x_j;$ for $d=3,$ triples; and so on. The feature space is finite-dimensional — $\binom{d_{\text{in}} + d}{d}$ — which is huge but bounded, unlike RBF’s infinite spectrum.

You can verify numerically:
| |
The two numbers agree to machine precision. This is the “kernel trick” in its simplest possible form: a 6-dimensional inner product computed in time $O(d_{\text{in}}) = O(2)$ instead of $O(6).$ The constant-factor win is invisible here but matters at $d=5$ on $\mathbb{R}^{100},$ where the explicit feature map has 96 million coordinates.
KernelRidge(poly) vs Ridge(PolynomialFeatures). Mathematically equivalent, computationally different:
| |
Kernel ridge dominates when $n < d_{\text{out}}$ (here $d_{\text{out}} = \binom{8+2}{2} = 45,$ so explicit features win); explicit features dominate when the expanded dimension is small. The crossover formula is roughly $n^3 \approx n \cdot d_{\text{out}}^2;$ above that, switch to explicit features.
When to use it. Sparse high-dimensional data where you have named interactions in mind. The two paradigm cases:
- NLP with bigrams or trigrams. A polynomial-2 kernel over word-count vectors essentially counts shared word pairs.
- Genomics with epistasis. Pairwise SNP interactions are the substance of many GWAS analyses.
When not to use it. Dense low-dimensional data with smooth structure. RBF wins almost every time here, because polynomials of moderate degree are too rigid to wrap around continuous decision boundaries, and high-degree polynomials oscillate violently.
Hyperparameter cheats.
- Degree $d$ . Stay in $\{2, 3\}.$ Degree-5+ polynomials almost always overfit (the monomial basis is too high-leverage).
- $\gamma$ . Scales the inner product. Polynomial kernels are brutally sensitive to feature magnitudes — always standardise first, or $\gamma\langle x, y\rangle$ blows up on the high-degree term.
- $c$ . The free coefficient. $c=0$ gives only top-order monomials (pure degree-$d$ ). $c=1$ mixes all orders 0 through $d.$ The default $c=1$ is almost always what you want.
| comparison | polynomial-2 | RBF | linear |
|---|---|---|---|
| sklearn alias | 'poly', degree=2 | 'rbf' | 'linear' |
| key knob | $d, \gamma, c$ | $\gamma$ | $C$ only |
| feature dim | $\binom{d_{\text{in}}+d}{d}$ | $\infty$ | $d_{\text{in}}$ |
| smoothness | $d$ -times differentiable | $C^\infty$ | $C^\infty$ but linear |
| best for | named interactions | smooth dense | high-d sparse |
| typical failure | high-degree blowup | identity Gram | underfitting |
A debugging cautionary tale. I once spent a week tuning a polynomial-3 SVM on a “categorical interactions” model that refused to converge. The fix turned out to be standardisation — one feature had a range of $[0, 10^4]$ while the rest were in $[-1, 1].$ At degree 3 the cube of that one feature dominated the kernel value by twelve orders of magnitude, so the Gram matrix was effectively rank-1 and the QP solver thrashed. The lesson is universal: always standardise before a polynomial kernel; the kernel sees the scale, even when it does not see the meaning.
Linear kernel: the “no kernel” kernel#
$$K(x, y) = \langle x, y \rangle.$$The trivial kernel: no feature map, $O(d)$ per evaluation, equivalent to running the algorithm directly on the raw data. Why is it even listed? Because for some data shapes it is the right answer, and switching to RBF buys you nothing but tuning headaches.
When linear is right.
- Linearly separable data. Often more common than you’d guess once you’ve already engineered features.
- Very high-dimensional sparse data. Text (TF-IDF, bag-of-words), gene expression matrices, one-hot encoded categoricals at scale. In high $d,$ random unit vectors are nearly orthogonal — so the data already looks “spread out enough” that an RBF kernel can’t help.
- As a baseline. Always fit a linear SVM or linear ridge regression before you reach for a nonlinear kernel. If linear wins, take the win — it’s faster to train, faster to predict, faster to interpret.
Worked example: LinearSVC vs SVC(kernel='linear') vs SGD. Three sklearn classes solve the same linear-SVM optimization but with very different scaling profiles. Run them on the 20-newsgroups dataset (TF-IDF, ~11k samples, ~130k features):
| |
On a typical laptop you should see roughly: SVC(linear) 90 s, LinearSVC 4 s, SGDClassifier 1.5 s. The kernel-based path (SVC) constructs an $n \times n$
Gram matrix and pays $O(n^2)$
memory plus $O(n^2 d)$
time; the explicit linear path (LinearSVC, SGD) works directly on the sparse feature matrix and pays $O(\text{nnz})$
per pass.
A rule of thumb. For text classification, linear SVMs often beat RBF SVMs on accuracy and run 100× faster. The curse of dimensionality is on your side here; embrace it.
Hyperparameter cheats.
- $C$ only. No $\gamma,$ no $d,$ no $c.$ Log-grid $\{10^{-2}, 10^{-1}, 1, 10, 10^2\}$ and you are done.
- Always standardize (or
TfidfVectorizeralready gives you unit-norm rows). Linear SVM is invariant to feature scale only after the optimizer converges; bad scaling slows convergence by orders of magnitude. dual='auto'inLinearSVCpicks primal when $n > d$ and dual when $n < d.$ Always faster than the libsvmSVC(kernel='linear').
Why the SGD path can beat coordinate descent. SGDClassifier with loss='hinge' solves the same primal as LinearSVC but with stochastic gradient updates. For very large $n$
(millions of documents) SGD’s per-step cost is $O(\text{nnz of one sample})$
instead of $O(\text{nnz of all})$
, so a single pass can outperform a full coordinate-descent epoch. The price is hyperparameter sensitivity — alpha, learning_rate, and max_iter all matter — and for moderate $n$
in the tens of thousands LinearSVC is faster and more accurate. Use SGD when you outgrow LinearSVC’s memory, not before.
Matern kernel: the GP workhorse#
$$K_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\, r}{\ell}\right)^{\!\nu} \! \mathcal{K}_\nu\!\left(\frac{\sqrt{2\nu}\, r}{\ell}\right), \qquad r = \|x - y\|,$$where $\mathcal{K}_\nu$ is the modified Bessel function of the second kind and $\Gamma$ the Gamma function. Looks scary; in practice nobody computes it by hand because it has a tunable smoothness parameter $\nu$ and three special cases that close-form simplify:
- $\nu = 1/2:$ exponential kernel, $K = \exp(-r/\ell).$ Sample paths are continuous but nowhere differentiable. Use for rough functions.
- $\nu = 3/2:$ once-differentiable. Geostatistics default. Closed form $K = (1 + \sqrt{3}r/\ell)\exp(-\sqrt{3}r/\ell).$
- $\nu = 5/2:$ twice-differentiable. The GP regression workhorse. Closed form $K = (1 + \sqrt{5}r/\ell + 5 r^2/(3\ell^2))\exp(-\sqrt{5}r/\ell).$
- $\nu \to \infty:$ recovers RBF. Infinitely smooth.

Why use Matern over RBF. RBF’s infinite smoothness is unrealistically strong for almost every real function. Stock prices, sensor readings, geological measurements — none of them are infinitely differentiable. A Matern-$5/2$ kernel admits sample paths that have a second derivative but no third, which matches reality much more closely. The classic GP regression failure mode — “my posterior is too confident, my length scales blow up, my hyperparameter optimiser diverges” — is often cured by switching RBF $\to$ Matern-$5/2.$
Choosing $\nu$ in three sentences.
- If you suspect the underlying function has kinks or sudden changes (financial returns, segment boundaries), use $\nu = 1/2.$
- If the function is “physical” (temperature, terrain, sensor drift), use $\nu = 5/2$ — second derivatives exist, third do not.
- Use $\nu = 3/2$ only if you have a specific reason (it is the standard in geostatistics, less so elsewhere).
Worked example: Matern vs RBF on noisy data. Fit both kernels to the same noisy 1D function and compare predictive performance out of sample.
| |
On rough-but-not-jagged data like this, Matern-$5/2$ typically posts 5-15% lower MSE than RBF, with calibrated predictive variances. On truly smooth data (analytic functions, fluid simulations), RBF wins by a hair. The rule “default to Matern-$5/2,$ verify with marginal likelihood” almost never goes wrong.
scikit-learn API. sklearn.gaussian_process.kernels.Matern(length_scale=1.0, nu=1.5). Note nu is a fixed parameter at construction time, not learned by fit(). You pick $\nu$
from $\{1/2, 3/2, 5/2\}$
by domain knowledge; length_scale and the noise variance are learned.
| comparison | Matern-$1/2$ | Matern-$3/2$ | Matern-$5/2$ | RBF (Matern-$\infty$ ) |
|---|---|---|---|---|
| differentiability | 0 | 1 | 2 | $\infty$ |
| sample paths | continuous, jagged | once-smooth | twice-smooth | analytic |
| typical use | finance, kinks | geostats | GP default | smooth physics |
sklearn nu | 0.5 | 1.5 | 2.5 | use RBF directly |
| length-scale role | same as RBF $\ell$ | same | same | same |
A subtle gotcha: noise vs. smoothness confusion. When you switch from RBF to Matern-$1/2$
and your fit suddenly looks worse, the usual culprit is that you have absorbed measurement noise into kernel roughness. Matern-$1/2$
says the function itself is rough; WhiteKernel says the function is smooth but you observe it through noise. Visually the two can look identical on a single sample but they extrapolate completely differently — Matern-$1/2$
gives jagged predictions, the noisy-smooth model gives confident smooth predictions plus a noise envelope. If you are unsure, fit both and compare marginal likelihoods; sklearn returns this as gp.log_marginal_likelihood(gp.kernel_.theta).
Periodic and spectral kernels#
$$K(x, y) = \exp\!\left(-\frac{2 \sin^2(\pi \|x - y\| / p)}{\ell^2}\right),$$captures strict periodicity with period $p.$ Two inputs at distance exactly $p$ have $K = 1$ (perfectly correlated); two at distance $p/2$ have $K$ at its minimum. The $\ell$ parameter controls how tightly the periodicity holds within a single cycle.

When to use. Anywhere you have a known period: temperature (yearly), electricity load (daily + weekly), retail sales (weekly + yearly), audio pitch tracking, EEG rhythms.
Spectral mixture kernels generalise this: a sum of Gaussians in the frequency domain corresponds to a complicated, possibly quasi-periodic kernel in the input domain. The Wilson–Adams 2013 paper showed these can discover unknown periodicities automatically. Useful when you suspect periodicity but don’t know the period.
Worked example: Mauna Loa CO2. This is the canonical demonstration of compositional kernels. The atmospheric CO2 record from 1958 onward has three obvious components — a smooth long-term trend, a yearly seasonal cycle, and short-term measurement noise — and a fourth less-obvious one (medium-term decadal wiggles from El Niño-class events). We will fit a sum of four kernels and look at what the optimizer learns.
| |
The optimizer learns a long-term RBF length scale on the order of 50 years (the trend is smooth), a periodic component with period exactly 1 year, a yearly-cycle decay length on the order of 100 years (cycles persist), and noise around 0.2 ppm (matching the measurement precision). The 20-year extrapolation produces the famous Rasmussen–Williams figure: a confident extension of the upward trend, the yearly cycle continuing, and uncertainty bands that widen sub-linearly because the model has learned the structure, not memorized the data.
The remarkable thing is that swapping out any one of the four kernels degrades the fit visibly. Drop k_seas and the model cannot extrapolate seasonality. Drop k_med and the residuals show systematic decadal bias. Drop k_long and the trend wanders. Each kernel encodes one prior, and the prior is literal — readable from the kernel expression.
Hyperparameter cheats for periodic kernels.
periodicityshould usually be_bounds='fixed'. If you know the period (1 day, 1 year), set it and freeze it. Letting the optimizer wander finds spurious periods.- Multiply by an RBF. Pure
ExpSineSquaredenforces strict periodicity forever, which is too rigid for real data.RBF * ExpSineSquaredgives a quasi-periodic kernel that lets the cycle drift slowly. - Watch the length scale $\ell$
inside
ExpSineSquared. Small $\ell$ means a spiky periodic structure (the function is very different at distance $p/2$ ); large $\ell$ means a smooth sinusoidal cycle. The sklearn defaultlength_scale=1.0is reasonable on standardized data.
Sigmoid kernel: a warning#
$$K(x, y) = \tanh(\gamma\, \langle x, y \rangle + c).$$Modelled after a neural-network activation, the sigmoid kernel has one fatal flaw: it is not always positive definite. Only for certain ranges of $\gamma$ and $c$ does the resulting Gram matrix stay PSD; outside that range, SVM solvers fail, kernel PCA returns negative eigenvalues, and GP regression silently misbehaves.
Code-verified counter-example. Generate 50 random 5D points and compute the Gram matrix for several $(\gamma, c)$ settings:
| |
Only the first setting is safely PSD. The others have negative eigenvalues big enough to break libsvm’s QP solver, and you will not get a clean error message — you will get a model that “trains” but whose predictions are garbage. The theoretical PSD region (Lin & Lin 2003) is roughly $\gamma > 0$ together with $c \le 0,$ but even there it is conditional, not unconditional.
If you find sigmoid kernels in a paper from 2002 — sure, they were popular before the deep-learning era as the “neural-network kernel”. In 2026, if you want neural-network-style nonlinearity, train a neural network. The sigmoid kernel is on this list mainly so you recognise it in legacy code and know to be suspicious.
Combining kernels: build with addition and multiplication#
The genuinely magical fact about kernels is that the PSD property is closed under several natural operations:
- Sum. $K_1 + K_2$ is PSD. Models the data as having two separate sources of structure. Decision boundaries from each kernel add.
- Product. $K_1 \cdot K_2$ is PSD. Models interaction between two sources of structure. Both kernels must agree for the product to be high.
- Scaling. $c \cdot K$ for $c > 0$ is PSD. Just changes the variance.
- Mapping. $K(\phi(x), \phi(y))$ is PSD for any $\phi.$
This means you can build a kernel like LEGO to match your prior knowledge of the data’s structure.

The canonical example revisited: Mauna Loa CO2. We already wrote the four-kernel CO2 sum above. Here is the grammar that produced it, applied to one more example so the pattern sinks in. Imagine forecasting daily store sales. Your data has:
- a long-term upward trend (the chain is growing),
- a weekly cycle (weekends spike),
- a yearly cycle (Christmas spikes),
- short-term promotional bursts,
- noise.
The kernel writes itself:
| |
This compositional view is why GPs are so much more interpretable than neural networks — the structure of your kernel literally tells you what kind of structure the model is allowed to fit.
| operation | kernel form | meaning | example |
|---|---|---|---|
| sum | $K_1 + K_2$ | independent additive components | trend + seasonality |
| product | $K_1 \cdot K_2$ | interaction, both must agree | seasonality that decays with time |
| scaling | $c \cdot K$ | change output variance | tune signal vs noise |
| input warping | $K(\phi(x), \phi(y))$ | new feature representation | RBF on log-prices |
Common composition pitfalls. Three things go wrong even after you understand the grammar. First, parameter explosion: a sum of four kernels with two hyperparameters each has eight parameters to optimize jointly, and n_restarts_optimizer=2 is rarely enough — bump it to 5 or 10 and accept the wall-clock cost. Second, unidentifiability: two RBFs with very different length scales can trade off mass in ways the optimizer cannot resolve, especially when one length scale dwarfs the data range; freeze the long-range one with a hand-picked value. Third, missing bounds: by default sklearn searches length_scale over $[10^{-5}, 10^5],$
which is fine for standardized data but absurd if your raw inputs span seconds-to-years; always set explicit length_scale_bounds when the data scale is unusual.
A kernel-selection decision tree#
The single most asked question is “which kernel should I try first?” Here is the decision tree that captures 90% of practice.

Step 0: always fit a linear baseline first. If linear is good enough, you are done.
Step 1: classify your data shape.
- Spatial / smooth and dense. Coordinates, sensor measurements, image features. → RBF or Matern-$5/2.$
- Temporal with seasonality. Time series, audio, weather. → Sum of RBF (trend) + Periodic (cycle) + WhiteNoise.
- High-d sparse. Text, one-hot, gene expression. → Linear.
- Known interactions. NLP n-grams, GWAS epistasis. → Polynomial degree 2 or 3.
- Categorical / structured. Strings, trees, graphs. → A specialised kernel (string kernel, graph kernel, tree kernel) — these are a whole topic of their own.
- Mixed. Some columns spatial, some categorical. → Build a sum of per-column kernels.
Step 2: pick hyperparameters with cross-validation on a log grid. Median heuristic as starting $\gamma;$ $C \in \{0.1, 1, 10, 100\}$ for SVM; for GP regression, optimise the marginal likelihood.
Step 3: when things go wrong, read the Gram matrix. If it looks like the identity, your $\gamma$ is too large. If it looks uniformly bright, $\gamma$ is too small. The matrix tells you before the model does.
Seven concrete scenarios. To make the tree usable, here is the first kernel I would reach for in seven everyday tasks:
| scenario | first kernel | second to try | why |
|---|---|---|---|
| classify two moons / toy 2D | RBF + median $\gamma$ | Matern-$5/2$ | smooth, dense, no prior |
| classify 20-newsgroups | Linear SVM on TF-IDF | RBF (rarely wins) | sparse, high-d |
| regress house prices on tabular | Linear ridge → poly-2 | RBF (with care) | mixed numeric, modest $n$ |
| forecast hourly electricity | Sum: RBF + Periodic-24h + Periodic-168h + WhiteNoise | spectral mixture | known multi-scale periodicity |
| Bayesian optimization | Matern-$5/2$ | RBF | rough, optimizer needs uncertainty |
| spatial geostatistics | Matern-$3/2$ | exponential | continuous but not infinitely smooth |
| classify protein sequences | string kernel | spectrum kernel | discrete, structured |
The point of the tree is not that it always picks the optimal kernel — no decision tree can. The point is to keep you from wasting weeks on RBF + grid search when the data shape was screaming “use linear” or “use a sum of kernels” from day one.
What’s next#
Part 5 turns this kernel catalogue into actual algorithms: kernel SVM, kernel PCA, and kernel ridge regression. We’ll see why every kernel algorithm in this part shares the same skeleton (the representer theorem from Part 3 strikes again), the practical $O(n^3)$ cost that limits classical kernel methods to ~10k samples, and the standard workarounds — Nystrom approximation, random Fourier features, inducing points. With Parts 4 and 5 in hand, you have the full picture of what to fit and how to fit it.
This is Part 4 of Kernel Methods (8 parts). Previous: Part 3 — RKHS Theory · Next: Part 5 — Kernel SVM, Kernel PCA, Kernel Ridge Regression
Kernel Methods 8 parts
- 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
- 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
- 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
- 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More you are here
- 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
- 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
- 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
- 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide