Series · Kernel Methods · Chapter 2

Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem

What makes a function a valid kernel? Positive-definiteness, the Gram matrix test, and Mercer's theorem — the spectral decomposition that justifies the kernel trick.

A week into kernel-SVM hacking I wrote what felt like a perfectly reasonable similarity function — tanh(1.5 * x.dot(y) - 2.0). It compiled, it ran, the math looked symmetric. Then sklearn coughed up ValueError: kernel matrix is not positive semidefinite and the optimiser produced a model that was worse than guessing.

That error message turned out to hide one of the deepest results in 20th-century analysis. “Positive-definite” is not a checkbox — it is the entire reason the kernel trick is allowed to exist. If your function is PSD, there exists a Hilbert space where it is a real inner product; if it is not, you are pretending to live in a space that nobody built. This post unpacks that statement, builds the operational tests, derives Mercer’s theorem, and works through enough numerical examples that the next time you see the failure message you will know exactly which line of math your kernel violated.

Cover: positive-definite kernels and Mercer spectral decomposition

We will cover three things, in order. First, the definition of a valid kernel and the operational Gram-matrix test, with sketch proofs of why it is the right test and code that runs the test on five common kernels plus a deliberate counter-example. Second, Mercer’s theorem — the spectral decomposition that turns abstract positive-definiteness into a concrete (possibly infinite-dimensional) feature map, with a worked numerical computation of the first three RBF eigenfunctions on $[0, 1].$ Third, the algebra of kernels: which operations build new kernels from old ones, why each rule is true, and which famous-looking functions secretly are not kernels at all.

This is the most “mathy” post in the series. Skim past proofs on first read if you want — the practical takeaways are tagged with Takeaway: lines, and every theoretical statement is followed by code you can run.

What counts as a valid kernel#

Let $\mathcal{X}$ be any set — vectors, strings, graphs, images, doesn’t matter. A function $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is called a valid kernel if it satisfies two properties:

Symmetry: $K(x, y) = K(y, x)$ for all $x, y \in \mathcal{X}.$

$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i, x_j) \;\geq\; 0.$$

Strictly speaking the inequality with $\geq 0$ defines positive semi-definite (PSD) kernels; literature often calls these “positive definite” anyway because the strict-$>$ case rarely shows up in practice. We will follow the slack convention.

The intuition is simple but worth spelling out. A kernel is a similarity function that behaves “like an inner product” when you stack it up against many points at once. Think of $c_i$ as a weight on point $x_i$ , so that $\sum_i c_i \Phi(x_i)$ is a weighted sum of feature vectors. The double sum $\sum_{ij} c_i c_j K(x_i, x_j)$ is exactly the squared norm $\|\sum_i c_i \Phi(x_i)\|^2$ in the feature space. PSD says: this squared norm is non-negative — geometrically, lengths are not allowed to be imaginary.

$$\sum_{i,j} c_i c_j (x_i x_j + 1) \;=\; \Big(\sum_i c_i x_i\Big)^2 + \Big(\sum_i c_i\Big)^2 \;\geq\; 0.$$

Both terms are squares of real numbers, so each is non-negative individually. Symmetry is immediate from the formula. The kernel is also a Mercer kernel with explicit feature map $\Phi(x) = (x, 1) \in \mathbb{R}^2$ : the verification $\Phi(x)^\top \Phi(y) = xy + 1$ matches the kernel exactly. The same trick — write the double sum as a sum of squares — handles every polynomial kernel; the algebra just gets bulkier as the degree grows.

Why symmetry alone is not enough. Symmetry is a free necessary condition: any inner product is symmetric, so any function that fails symmetry is immediately disqualified. But many symmetric functions fail PSD. The classic trap is cos(distance) — symmetric, bounded, looks lovely, and not a kernel. We will see its Gram matrix in a moment.

Bochner’s theorem — a brief aside. For translation-invariant candidates $K(x, y) = k(x - y),$ Bochner’s theorem gives a clean spectral characterisation: $k$ is a valid kernel iff its Fourier transform is a non-negative measure. The RBF kernel works because its Fourier transform is a Gaussian (non-negative density); the cosine-of-distance fails because its Fourier transform is a signed combination of delta spikes. We will return to Bochner in the trap-zone section at the end.

Takeaway: “Symmetric + PSD” is the gate. Pass both, you have a kernel; fail either, you do not — no matter how reasonable your function looks.

The Gram matrix test — operational PSD checking#

The abstract PSD definition is not something you ever check by hand. In practice you check Gram matrices.

$$G_{ij} \;=\; K(x_i, x_j).$$

A symmetric function $K$ is positive-definite if and only if every Gram matrix it produces — for every $n$ , every choice of points — is a PSD matrix. The “for every” is what makes the abstract definition strong; in practice we sample one or a few datasets and check.

There are three equivalent ways to verify that a symmetric matrix $G$ is PSD, listed from “most pedagogically transparent” to “fastest in code”:

Eigenvalue test: all eigenvalues of $G$ are non-negative, $\lambda_i(G) \geq 0$ for all $i.$ This is what np.linalg.eigvalsh gives you immediately, in $O(n^3)$ time with a moderate constant.

Quadratic-form test: for every $v \in \mathbb{R}^n,$ $v^\top G v \geq 0.$ Useful theoretically, terrible computationally (uncountably many $v$ ). Mostly used in proofs.

Cholesky test: the Cholesky factorisation $G = LL^\top$ succeeds (with $L$ lower-triangular and real). This is what production code uses — it is $O(n^3)$ but with a small constant (~$\tfrac{1}{3} n^3$ flops, half of the eigendecomposition), and it fails fast if $G$ is indefinite.

Why the three are equivalent. A symmetric matrix admits an eigendecomposition $G = U \Lambda U^\top$ with $U$ orthogonal and $\Lambda = \text{diag}(\lambda_1, \dots, \lambda_n).$ Then $v^\top G v = (U^\top v)^\top \Lambda (U^\top v) = \sum_i \lambda_i w_i^2$ where $w = U^\top v.$ This is non-negative for all $v$ iff every $\lambda_i$ is non-negative. Cholesky succeeds iff $G$ is PSD because the algorithm computes $L_{jj} = \sqrt{G_{jj} - \sum_{k < j} L_{jk}^2}$ in sequence; a negative argument under the square root is exactly the failure mode and signals indefiniteness.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np

def is_psd(G, tol=1e-8):
    """Three checks, ordered from cheap-and-loose to strict."""
    if not np.allclose(G, G.T, atol=tol):
        return False, "not symmetric"
    try:
        np.linalg.cholesky(G + tol * np.eye(len(G)))
        return True, "cholesky succeeded"
    except np.linalg.LinAlgError:
        eig_min = np.linalg.eigvalsh(G).min()
        return False, f"min eigenvalue = {eig_min:+.3e}"


def gram(K_fn, X):
    n = len(X)
    G = np.empty((n, n))
    for i in range(n):
        for j in range(n):
            G[i, j] = K_fn(X[i], X[j])
    return G


rng = np.random.default_rng(0)
X = rng.normal(size=(30, 2))

kernels = {
    "linear":     lambda x, y: x @ y,
    "poly d=3":   lambda x, y: (x @ y + 1.0) ** 3,
    "rbf g=0.5":  lambda x, y: np.exp(-0.5 * np.sum((x - y) ** 2)),
    "laplace":    lambda x, y: np.exp(-0.5 * np.sum(np.abs(x - y))),
    "cosine":     lambda x, y: (x @ y) / (np.linalg.norm(x) * np.linalg.norm(y) + 1e-12),
    # The trap: cos of distance — symmetric, bounded, NOT PSD.
    "cos(2*d)":   lambda x, y: np.cos(2.0 * np.linalg.norm(x - y)),
}

for name, K in kernels.items():
    G = gram(K, X)
    print(f"{name:10s} -> {is_psd(G)}")

Running this prints True for the first five kernels and False for cos(2*d) with a clearly negative minimum eigenvalue (around $-3$ for these data). Notice that the test does not prove a kernel is PSD universally — it only proves it failed on at least one dataset. Establishing PSD in general requires the kind of analytic argument we used above for $xy + 1$ (or a reference to Bochner). Establishing non-PSD only needs one counter-example Gram matrix.

Gram matrices of a valid RBF kernel vs an indefinite “cos(distance)” function

Look at the right panel above. The matrix is symmetric, smooth, and bounded — visually it looks no worse than the left one. Yet its minimum eigenvalue is clearly negative, which means there exists a vector $v$ such that $v^\top G v < 0.$ Such a $v$ would correspond to a sample whose self-similarity is negative, which has no geometric interpretation.

Tolerance footnote. In floating-point, tiny negative eigenvalues like $-10^{-12}$ are usually rounding artefacts, not real indefiniteness. Most libraries add a small jitter $\epsilon I$ before factorising; tol=1e-8 in the snippet above is a reasonable default. For Gaussian-process libraries the conventional jitter is $10^{-6}$ on the diagonal, which is large enough to mask floating-point noise but small enough not to bias the posterior.

Complexity table. When you run an SVM at scale, every iteration touches the Gram matrix. The cost matters:

OperationCostWhen it bites
Build $G$ for $n$ points$O(n^2 d)$ time, $O(n^2)$ memoryAlways — this is the wall above $n \approx 30{,}000$
Eigendecomposition$O(n^3)$ timeOne-shot diagnostics only
Cholesky$\sim \tfrac{1}{3} n^3$ flopsSolving $G \alpha = y$ in KRR / GP
Conjugate-gradient solve$O(n^2 \cdot \text{iters})$When $G$ is PSD and well-conditioned
Nystrom approximation$O(n m^2)$ with $m \ll n$The standard escape hatch above $n \approx 10^5$

Takeaway: the Gram matrix is both the operational PSD test and the computational bottleneck. Getting it right matters; keeping it small matters more.

Aside — the diagonal as a self-similarity prior. Notice that PSD forces $G_{ii} = K(x_i, x_i) \geq 0$ because the single-point case $n = 1, c = 1$ gives $K(x, x) \geq 0$ directly. Stronger: by Cauchy–Schwarz applied to the implicit feature map, $|K(x, y)| \leq \sqrt{K(x, x) K(y, y)}.$ This bound is why “normalised kernels” $\tilde K(x, y) = K(x, y) / \sqrt{K(x, x) K(y, y)}$ always live in $[-1, 1]$ and behave like cosine similarities in the feature space. Cosine similarity itself is just the normalised linear kernel, which is one corner case of this general construction.

Mercer’s theorem — the spectral statement#

Once you accept that PSD is the right condition, the natural next question is: what does the implicit feature map look like? The answer is Mercer’s theorem.

$$K(x, y) \;=\; \sum_{k=1}^{\infty} \lambda_k\, \phi_k(x)\, \phi_k(y),$$

with the series converging absolutely and uniformly on $\mathcal{X} \times \mathcal{X}.$

$$(T_K f)(x) \;=\; \int_{\mathcal{X}} K(x, y)\, f(y)\, dy,$$

i.e. $T_K \phi_k = \lambda_k \phi_k.$

This is the continuous analogue of “every symmetric PSD matrix has the eigendecomposition $G = \sum_k \lambda_k u_k u_k^\top$ .” The integral operator $T_K$ plays the role of the matrix; the eigenfunctions play the role of eigenvectors; the spectrum is now an infinite (but decaying) sequence rather than a finite list.

Proof sketch (step by step). The argument has four moves, each standard but worth naming.

(1) Compactness of $T_K.$ Because $K$ is continuous on the compact set $\mathcal{X} \times \mathcal{X},$ it is bounded, so $K \in L^2(\mathcal{X} \times \mathcal{X}).$ Operators of the form $T_K f(x) = \int K(x,y) f(y) dy$ with $K \in L^2$ are called Hilbert–Schmidt, and Hilbert–Schmidt operators are compact (the unit ball maps to a precompact set in $L^2$ ).

(2) Self-adjointness. Because $K$ is symmetric, $\langle T_K f, g \rangle = \int\!\!\int K(x,y) f(y) g(x) dy dx = \langle f, T_K g\rangle.$ So $T_K = T_K^*.$

(3) Spectral theorem. The spectral theorem for compact self-adjoint operators on a Hilbert space gives a complete orthonormal eigenbasis $\{\phi_k\} \subset L^2(\mathcal{X})$ with real eigenvalues $\{\lambda_k\}$ that accumulate only at $0$ (i.e. $\lambda_k \to 0$ ).

(4) Non-negativity and uniform convergence. Positive-definiteness of $K$ implies $\langle T_K \phi_k, \phi_k\rangle = \lambda_k \geq 0.$ Expanding $K$ in the eigenbasis gives $K(x, y) = \sum_k \lambda_k \phi_k(x) \phi_k(y)$ in $L^2.$ Mercer’s continuity assumption is what upgrades this to uniform convergence — that last step is the technical heart of Mercer’s original 1909 paper and the reason the theorem bears his name.

$$\Phi(x) \;=\; \big(\sqrt{\lambda_1}\,\phi_1(x),\; \sqrt{\lambda_2}\,\phi_2(x),\; \dots\big),$$ $$\langle \Phi(x), \Phi(y) \rangle_{\ell^2} \;=\; \sum_k \lambda_k \phi_k(x) \phi_k(y) \;=\; K(x, y).$$

So the kernel trick is not magic. There really is a Hilbert space $\mathcal{H} = \ell^2$ and a map $\Phi$ such that $K$ computes the inner product in $\mathcal{H}.$ Mercer just tells you what $\Phi$ is.

Worked example — first three RBF eigenfunctions on $[0, 1].$ The RBF kernel $K(x, y) = \exp(-\gamma(x - y)^2)$ has no closed-form eigenfunctions on $[0, 1]$ , but we can solve the eigenproblem numerically by discretising the integral operator on a grid. The procedure: take $m = 200$ evenly spaced points $x_i \in [0, 1]$ , form the $m \times m$ matrix $G_{ij} = K(x_i, x_j) / m$ (the $1/m$ is the integration weight), and compute its top eigenvalues and eigenvectors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

m, gamma = 200, 25.0
xs = np.linspace(0, 1, m)
K = np.exp(-gamma * (xs[:, None] - xs[None, :]) ** 2) / m
w, V = np.linalg.eigh(K)
order = np.argsort(w)[::-1]
w, V = w[order], V[:, order]

# Top 3 eigenvalues and a renormalised eigenfunction sample.
for k in range(3):
    phi_k = V[:, k] * np.sqrt(m)        # L^2-normalise on [0, 1]
    sign = np.sign(phi_k[m // 2]) or 1  # fix sign convention
    phi_k *= sign
    print(f"lambda_{k+1} = {w[k]:.4f}, phi_{k+1}(0.5) = {phi_k[m // 2]:+.3f}")

Running this with $\gamma = 25$ gives roughly $\lambda_1 \approx 0.18,$ $\lambda_2 \approx 0.18,$ $\lambda_3 \approx 0.13.$ The first eigenfunction is bump-shaped and positive, the second is a single oscillation (one sign change), the third has two sign changes. The pattern continues: the $k$ -th eigenfunction has roughly $k - 1$ zeros on $[0, 1].$ This is the kernel analogue of Fourier modes and is exactly what fig 3 visualises on the symmetric interval $[-3, 3].$

Sanity check. Reconstruct $K$ from the top $r$ eigenpairs and measure the residual: $\|K - \sum_{k=1}^{r} \lambda_k \phi_k \phi_k^\top\|_F$ should drop quickly as $r$ grows. With $\gamma = 25$ and $r = 30$ , the residual is already below $10^{-6}$ in Frobenius norm relative to $\|K\|_F$ . This is the empirical evidence that the Mercer series converges fast for RBF.

The link to PCA in feature space. The Mercer eigenpairs are exactly the principal components of the (centered) data after the implicit feature map $\Phi.$ More precisely, the empirical kernel PCA on $n$ points computes the top eigenvectors of the centered Gram matrix $\tilde G = (I - \tfrac{1}{n} \mathbf{1}\mathbf{1}^\top) G (I - \tfrac{1}{n} \mathbf{1}\mathbf{1}^\top),$ and its eigenvalues are finite-sample estimators of the Mercer eigenvalues $\lambda_k$ scaled by $n.$ As $n \to \infty,$ the empirical eigenvalues converge to the population ones at rate $O(1/\sqrt{n}).$ This is why kernel PCA “just works” without ever materialising $\Phi$ — the Gram matrix already contains the information PCA needs, packaged in a finite-dimensional sub-matrix of the infinite-dimensional truth.

$$G_{ij} = K(x_i, x_j) = \sum_k \lambda_k \phi_k(x_i) \phi_k(x_j).$$

Stacked, $G = \Phi_n \Lambda \Phi_n^\top$ where $\Phi_n \in \mathbb{R}^{n \times \infty}$ has $(\Phi_n)_{ik} = \phi_k(x_i)$ and $\Lambda$ is the diagonal of eigenvalues. The top eigenvectors of $G$ are exactly weighted samples of the top Mercer eigenfunctions. The numerical worked example above is doing this in practice with $n = m = 200$ grid points; the only approximation is replacing the integral with a Riemann sum.

The geometry — where does the high-dimensional feature space come from#

Look at the picture below. It shows the first five Mercer eigenfunctions of the RBF kernel on $[-3, 3].$

Top 5 Mercer eigenfunctions of the RBF kernel and the eigenvalue decay

Three observations worth tattooing onto your brain:

The eigenfunctions look like smooth, increasingly oscillatory bumps. This is the kernel analogue of Fourier modes. The $k$ -th eigenfunction $\phi_k$ has roughly $k$ zero crossings; it captures features at a finer spatial scale than $\phi_{k-1}.$ The kernel “sees” similarity at all of these scales simultaneously and the spectrum tells you how strongly each scale contributes.

The eigenvalues decay fast. For RBF the decay is exponential. This is why low-rank approximations work — keeping the top $r$ Mercer components captures almost all of the kernel’s structure. Methods like Nystrom and Random Fourier Features explicitly exploit this decay to build cheap approximations of $\Phi.$ The relevant theoretical statement is that for the Gaussian kernel on $\mathbb{R}^d$ with bandwidth $\sigma$ and data drawn from a sub-Gaussian distribution, $\lambda_k$ decays roughly as $\exp(-c k^{1/d})$ — exponentially in dimension-corrected index.

The spectrum is genuinely infinite. For RBF, every $\lambda_k$ is strictly positive — the implicit feature space has infinite dimension. That sounds terrifying, but the rapid decay means only the first dozen or so dimensions carry meaningful “signal energy”. You get the expressive power of an infinite space with the effective complexity of a small one. This is the formal version of the slogan “RBF can fit anything but does not have to use everything.”

Connection to the polynomial kernel. For $K(x,y) = (x^\top y + 1)^d$ on $\mathbb{R}^p$ , the Mercer expansion is finite: it has exactly $\binom{p+d}{d}$ non-zero eigenvalues, one per monomial of degree $\leq d$ . The eigenfunctions are the monomials. So polynomial kernels really do give you the brute-force polynomial feature map — Mercer just makes it official.

Connection to Bochner and Fourier. For stationary kernels — meaning $K(x, y) = k(x - y)$ — Bochner’s theorem gives an alternative spectral representation: $K(x, y) = \int e^{i \omega^\top (x - y)} d\mu(\omega)$ for some non-negative measure $\mu.$ This is dual to Mercer in a precise sense. Mercer gives a discrete spectrum tied to the support set $\mathcal{X};$ Bochner gives a continuous spectrum on the frequency domain $\mathbb{R}^d.$ Random Fourier Features lifts the Bochner integral into a Monte-Carlo approximation: sample $\omega_j \sim \mu$ and use the random feature map $\Phi(x) = (\cos(\omega_j^\top x), \sin(\omega_j^\top x))_j.$ This is the bridge that lets you turn any stationary kernel into a finite-dimensional feature map without ever computing eigenfunctions.

Takeaway: PSD is the abstract condition; Mercer turns it into a concrete (often infinite, always orthogonal) feature expansion. The decay rate of $\{\lambda_k\}$ is what controls “effective dimensionality” and is the lever both Nystrom and Random Fourier Features pull on.

The PSD test, visually#

Sometimes the right way to internalise “positive-definite” is to see the quadratic form. The picture below plots $Q(v) = v^\top G v$ over a 2-D slice of $v$ for two different $3 \times 3$ Gram-like matrices.

Quadratic form vᵀGv for a PSD matrix vs an indefinite matrix

The left surface is a bowl that sits entirely on or above zero — that is what PSD means: every direction $v$ gives a non-negative answer. The right surface is a saddle that dips below zero in a whole half-plane (the red contour is the zero-level set). That zero-crossing is the geometric meaning of a negative eigenvalue.

When you feed the indefinite Gram matrix to an SVM solver, the optimiser tries to walk into that negative region — because the dual objective looks like $-\tfrac{1}{2} v^\top G v + \text{linear},$ and minimising a concave quadratic over a constraint set either runs off to infinity or hits the boundary at a meaningless corner. With a PSD $G,$ the quadratic part is bounded below (or above, depending on sign convention), and the dual is a proper convex problem; with an indefinite $G,$ the objective is unbounded in the wrong direction and the solver either diverges, complains, or returns nonsense. That is exactly the kernel matrix is not positive semidefinite failure.

Why the bowl-and-saddle picture is enough. In dimensions higher than $3$ you cannot draw the surface, but the qualitative dichotomy is unchanged. A real symmetric matrix is either PSD (every direction non-negative), negative semi-definite (every direction non-positive), or indefinite (some directions positive, some negative). The indefinite case is the only failure mode for kernels — negative semi-definite Gram matrices essentially never appear in practice because $G_{ii} = K(x_i, x_i) \geq 0$ for any reasonable similarity function, which forces at least one non-negative eigenvalue.

Eigenvalue spectra of the famous five#

The figure below shows the eigenvalue decay (log scale) of five widely-used kernels on the same 120 sample points from $[-3, 3].$

Eigenvalue spectra of linear, polynomial, RBF and Laplace kernels

A few things to read off:

Linear decays in one step: rank is $\min(n, d) = 1$ here because $X$ is 1-D. Everything beyond $\lambda_1$ is numerical noise.

Polynomial $(d=3)$ has rank exactly 4 — the four monomials $1, x, x^2, x^3$ — after which the spectrum cliff-drops to numerical zero.

RBF $(\gamma = 0.5)$ decays smoothly and slowly: many components carry meaningful weight. This is the visual signature of “infinite-dimensional but compressible”.

RBF $(\gamma = 2.0)$ decays much slower than $\gamma = 0.5$ — larger $\gamma$ means narrower bumps means finer details means richer feature space means slower spectral decay. This is the spectral version of bias-variance: large $\gamma$ buys representational richness but spends generalisation budget.

Laplace sits between RBF and polynomial: power-law decay $\lambda_k \sim k^{-2}$ rather than exponential. This is why Laplace kernels handle non-smooth signals better than RBF — they have more “high-frequency budget” because power-law tails put more mass on high-$k$ modes than exponential tails do.

Summary table.

KernelSpectrum shapeEffective rankTypical use
Linearrank $\min(n, d)$$d$Baseline; sparse/high-d data
Polynomial $d$rank $\binom{p+d}{d}$$O(p^d)$Image patches, ranking
RBFexponential decay$\sim \log n / \gamma$Default for moderate $n$
Laplace$\lambda_k \sim k^{-2}$$\sim n^{1/2}$Rough signals, time series
Sigmoidsometimes negativeAvoid; historical only

This single plot encodes most of the practical intuition about kernel choice. Faster decay = simpler implicit feature space = lower model capacity = harder to overfit. We will revisit this when we discuss bandwidth tuning in Part 5.

$$d_{\text{eff}}(\mu) = \sum_k \frac{\lambda_k}{\lambda_k + \mu}.$$

This counts “how many Mercer modes meaningfully participate” at the chosen regularisation level. For RBF on smooth data, $d_{\text{eff}}$ is small and grows logarithmically with $1/\mu;$ for Laplace it grows like $\mu^{-1/2}.$ Kernel ridge regression’s generalisation bound is $O(d_{\text{eff}} / n),$ so the spectrum directly controls statistical risk. Part 4 derives this bound in detail.

Operator norm and trace. Two scalar summaries of a kernel that recur in the theory:

  • $\|T_K\|_{\text{op}} = \lambda_1$ (largest eigenvalue), bounds how strongly the kernel can amplify any input function.
  • $\text{tr}(T_K) = \sum_k \lambda_k = \int_{\mathcal{X}} K(x, x) dx$ (trace), measures the total “energy” of the implicit feature map.

For RBF with bandwidth $\sigma$ on a bounded domain of volume $V$ : $\text{tr}(T_K) = V$ since $K(x, x) = 1.$ For polynomial $(x^\top y + c)^d$ on $[-1, 1]^p$ : trace $= \int (\|x\|^2 + c)^d dx,$ which grows like $(p + c)^d.$ The trace-vs-rank trade-off is exactly the kernel-equivalent of width-vs-depth in neural networks.

Building new kernels — the closure properties#

You almost never write a kernel from scratch. Instead you compose simple kernels using a small algebra of PSD-preserving operations. Knowing these rules lets you design custom kernels for structured data without re-proving PSD every time.

Rule 1 (sum). If $K_1$ and $K_2$ are kernels on $\mathcal{X},$ then so is $K_1 + K_2.$

Proof sketch: $v^\top (G_1 + G_2) v = v^\top G_1 v + v^\top G_2 v \geq 0.$ Each summand is non-negative by hypothesis, so the sum is.

Counter-example warning: $K_1 - K_2$ is not generally a kernel. Quick check: let $K_1 = K_{\text{linear}}$ and $K_2 = 2 K_{\text{linear}}.$ Then $K_1 - K_2 = -K_{\text{linear}},$ whose Gram matrix is $-X X^\top$ — strictly negative semi-definite, definitely not PSD unless $X = 0.$

Rule 2 (positive scalar multiple). If $K$ is a kernel and $c \geq 0,$ then $cK$ is a kernel. The proof is one line: $v^\top (cG) v = c (v^\top G v) \geq 0.$ Negative scalars break it for the same reason as subtraction.

Rule 3 (Hadamard product). If $K_1$ and $K_2$ are kernels, then the pointwise product $(K_1 K_2)(x, y) = K_1(x,y) K_2(x, y)$ is a kernel. This is the Schur product theorem: the entry-wise product of two PSD matrices is PSD. The proof uses the fact that $G_1 \otimes G_2$ (Kronecker product) is PSD when both factors are, and $G_1 \odot G_2$ (Hadamard) is a principal sub-matrix of $G_1 \otimes G_2.$ Non-trivial to prove, immensely useful.

Rule 4 (tensor product / direct sum on product spaces). If $K_1$ is a kernel on $\mathcal{X}_1$ and $K_2$ on $\mathcal{X}_2,$ then $K\big((x_1, x_2), (y_1, y_2)\big) = K_1(x_1, y_1) K_2(x_2, y_2)$ is a kernel on $\mathcal{X}_1 \times \mathcal{X}_2.$ This is how you handle mixed data types (e.g. numeric + categorical) cleanly. The feature space is the tensor product $\mathcal{H}_1 \otimes \mathcal{H}_2.$

Rule 5 (composition with $\phi$ ). If $f: \mathcal{X} \to \mathcal{Z}$ is any function and $K_Z$ is a kernel on $\mathcal{Z},$ then $K(x, y) = K_Z(f(x), f(y))$ is a kernel on $\mathcal{X}.$ The proof is trivial: any Gram matrix of $K$ is also a Gram matrix of $K_Z$ on the image points $\{f(x_i)\},$ hence PSD. This is the rule that lets you build kernels for strings, graphs, and other non-vector data: pick a feature extractor $f$ and lift any vector kernel.

Rule 6 (limits and infinite sums). A pointwise limit of kernels (when it exists) is a kernel. Therefore convergent power series with non-negative coefficients applied to a kernel yield kernels — in particular $\exp(K)$ is always a kernel, and so are $(1 - K)^{-1}$ wherever it converges, and any $\sum_k a_k K^k$ with $a_k \geq 0.$

Six kernel-combination rules visualised on the same 80-point grid

The figure above runs all six operations on the same underlying RBF and quadratic-polynomial kernels. Every resulting Gram matrix has a positive minimum eigenvalue (the title text confirms this). That is the algebraic closure in action.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Build a custom kernel for (numeric, categorical) pairs.
def K_numeric(x_num, y_num, gamma=0.5):
    return np.exp(-gamma * np.sum((x_num - y_num) ** 2))

def K_categorical(x_cat, y_cat):
    return float(x_cat == y_cat)        # Kronecker delta — a valid kernel

def K_combined(p, q):
    # Rule 4: tensor product on product space
    return K_numeric(p[0], q[0]) * K_categorical(p[1], q[1])

# Numerical check of the subtraction counter-example.
X = np.random.default_rng(0).normal(size=(20, 2))
K1 = X @ X.T                            # linear kernel Gram
K2 = 2 * (X @ X.T)                      # 2x linear kernel Gram
diff = K1 - K2                          # = -linear, should NOT be PSD
print("min eig of K1 - K2:", np.linalg.eigvalsh(diff).min())   # negative

That tiny composition above is a full-fledged Mercer kernel by Rule 4 — you do not need to derive its eigenfunctions to use it. The subtraction check at the bottom is the empirical confirmation that “every rule has a counter-example for the rule it does not state”: you can add but not subtract, you can multiply by positive but not negative scalars.

Takeaway: sum, scalar (positive), product, tensor, composition, exp. Six rules; combine freely. Subtract and negative-scale at your peril.

Designing a string kernel via Rule 5 — a concrete example. Suppose you want to compare DNA sequences. Pick a $k$ -mer feature extractor: $f(s) \in \mathbb{R}^{4^k}$ counts how many times each length-$k$ substring appears in $s.$ Then by Rule 5, $K(s, t) = K_Z(f(s), f(t))$ is a valid kernel on strings for any vector kernel $K_Z.$ The most common choice is $K_Z(u, v) = u^\top v$ (linear), giving the “spectrum kernel” $K(s, t) = \sum_{w \in \Sigma^k} \#_w(s) \cdot \#_w(t).$ With $k = 4$ and a 4-letter alphabet you get a 256-dimensional explicit feature space; the kernel can be computed in $O(|s| + |t|)$ time using suffix trees, which is much cheaper than materialising both feature vectors. This pattern — pick a feature extractor with combinatorial structure, lift via Rule 5, exploit the structure for fast computation — is the template behind every “kernel on X” paper from the 2000s.

What is not a kernel — three famous traps#

Knowing what fails sharpens what passes. Three common functions that look fine but aren’t kernels, each with a numerical witness:

Trap 1: the sigmoid kernel $\tanh(\alpha x^\top y + c).$ This is famous in old neural-network-style SVMs. It is only PSD for a narrow band of $(\alpha, c)$ — and even then, only on bounded input regions. For most parameter choices, the Gram matrix has negative eigenvalues.

Sigmoid kernel: PSD for one parameter setting, indefinite for another

The right panel above shows what happens with $\alpha = 1.5, c = -2.0$ : more than half the eigenvalues are negative. Sklearn will accept this and produce a “model”, but the optimisation is mathematically meaningless. Most modern code defaults to RBF instead of sigmoid for exactly this reason.

1
2
3
4
5
# Numerical witness: sigmoid kernel fails PSD for typical parameters.
rng = np.random.default_rng(1)
X = rng.normal(size=(50, 5))
G_sig_bad = np.tanh(1.5 * X @ X.T - 2.0)
print("sigmoid min eig:", np.linalg.eigvalsh(G_sig_bad).min())   # negative

Trap 2: distance functions $K(x, y) = d(x, y).$ Euclidean distance is symmetric and feels similarity-like, but it is not a kernel — its diagonal is zero and off-diagonals are positive, so for the constant vector $\mathbf{1}$ the quadratic form $\mathbf{1}^\top G \mathbf{1} = \sum_{ij} d(x_i, x_j) > 0$ but the eigenvalue test fails. The Gaussian-of-distance, $K(x,y) = \exp(-\gamma d(x,y)^2),$ is a kernel, which is why RBF is built that way.

1
2
3
# Distances are not kernels.
D = np.linalg.norm(X[:, None] - X[None, :], axis=-1)
print("distance min eig:", np.linalg.eigvalsh(D).min())          # negative

Trap 3: $K(x, y) = \cos(\alpha \cdot d(x, y)).$ Symmetric, bounded, looks like a generalised similarity. Not a kernel: see the right panel of fig 1 — its Gram matrix has wildly negative eigenvalues. The oscillation kills positive-definiteness because the Fourier transform of $\cos(\alpha r)$ is a signed pair of delta spikes, violating the Bochner condition.

The Bochner safety net. For translation-invariant candidate kernels $K(x, y) = k(x - y),$ Bochner’s theorem says $k$ is a valid kernel iff its Fourier transform is a non-negative measure. So the RBF kernel works because its Fourier transform is Gaussian (non-negative); $\cos(\alpha \|x - y\|)$ fails because its Fourier transform is a difference of two delta spikes — signed, not non-negative. If you ever need to design your own stationary kernel, design it in the Fourier domain: pick a non-negative spectral density first, then take the inverse Fourier transform to get $k$ . This is exactly the design pattern behind the Matern family.

A 30-second proof tour of the five classic kernels#

For closure, here is why each of the five most-used kernels is PSD:

Linear: $K(x, y) = x^\top y.$ Gram matrix is $X X^\top$ which is always PSD (any $A^\top A$ is PSD because $v^\top A^\top A v = \|Av\|^2 \geq 0$ ). The rank is $\min(n, d).$ For high-dimensional sparse data this is often all you need; modern text classifiers with TF-IDF features are linear-kernel SVMs by default.

Polynomial: $K(x, y) = (x^\top y + c)^d$ with $c \geq 0, d \in \mathbb{N}.$ Sum of $\binom{p+d}{d}$ rank-1 PSD matrices, one per monomial via the multinomial expansion. Or just apply Rules 1, 2, 3, 6 to the linear kernel: $K_{\text{linear}} + c$ is PSD (constant kernel $c \cdot \mathbf{1}\mathbf{1}^\top$ is rank-1 PSD), and raising a kernel to integer power $d$ is iterated Hadamard, which preserves PSD by Rule 3.

RBF / Gaussian: $K(x, y) = \exp(-\gamma \|x - y\|^2).$ Bochner with Gaussian Fourier transform; or by Rule 6, expand $\exp$ as a power series in the (negative of the) squared distance and verify each term is PSD using the polynomial-kernel decomposition above.

Laplace: $K(x, y) = \exp(-\gamma \|x - y\|_1).$ Bochner with Cauchy-distribution Fourier transform — non-negative, so PSD. The fact that Laplace lives in $L^1$ rather than $L^2$ for its spectral density is what gives it the polynomial-decay spectrum we saw earlier. Equivalently, the Laplace kernel is the Gaussian-of-Cauchy convolution; the heavy tail of the Cauchy spectrum is what licenses the heavy tail of $\lambda_k$ in the $k$ -domain.

Sigmoid: not PSD in general (see Trap 1). Listed as a kernel for historical reasons; use RBF instead. The relationship $\tanh(\alpha x^\top y + c) = \tanh \circ K_{\text{linear, shifted}}$ does not invoke Rule 5 because $\tanh$ is not the kind of composition Rule 5 allows — Rule 5 composes the input with a function $f,$ not the output with a non-positive-coefficient series like $\tanh$ (which is $x - x^3/3 + \dots$ with alternating signs).

Numerical complexity in practice#

The PSD machinery has direct implications for runtime. A short table of what each operation costs and when it dominates:

StageCostMemoryDominant beyond
Build Gram $G$$O(n^2 d)$$O(n^2)$$n \sim 30{,}000$
Cholesky for KRR$\tfrac{1}{3} n^3$ flops$O(n^2)$$n \sim 10^4$
Eigendecomp for kernel PCA$O(n^3)$$O(n^2)$$n \sim 5000$
Conjugate gradient on PSD $G$$O(n^2 \cdot \kappa(G)^{1/2})$ per tol$O(n^2)$When $\kappa(G)$ is moderate
Nystrom rank-$m$ approx$O(n m^2 + m^3)$$O(nm)$The standard escape hatch above $n \sim 10^5$
Random Fourier Features$O(n m d)$$O(nm)$Stationary kernels only

The eigenvalue spectrum we have been discussing is what makes the bottom two rows possible: if $\lambda_k$ decays fast, a low-rank approximation captures essentially the full kernel with $m \ll n$ components. Part 7 will use this to scale RBF SVMs from $10^4$ to $10^7$ points.

Connection to neural networks — a single paragraph. The “neural tangent kernel” (NTK) of a wide neural network is a Mercer kernel whose eigenfunctions are spherical harmonics on the unit sphere $S^{d-1}$ and whose eigenvalues decay polynomially in the harmonic index. The fact that wide networks train in the “lazy regime” — moving each weight by $O(1/\sqrt{\text{width}})$ — is the modern reason to care about kernel theory beyond classical SVMs. The eigenvalue decay of the NTK is what controls how fast gradient descent fits each frequency component of the target function, which is why low-frequency targets are learned first (the “spectral bias” phenomenon). Everything in this post applies, with $\Phi$ replaced by the gradient of the network at initialisation.

Three case studies — PSD failures in the wild#

To anchor the theory, three real-world failure modes I have hit personally and how Mercer-aware thinking diagnoses each.

Case A — sigmoid SVM, finance time series, 2018. A tanh-kernel SVM trained on rolling-window returns gave bizarre AUC oscillations across cross-validation folds. Diagnosis: the dual was non-convex because the Gram matrix had 8 negative eigenvalues out of 250. The local minima the SMO solver returned depended on the random initialisation. Fix: swap to RBF, AUC stabilised, training time halved. Lesson: any time you see solver instability that is not fixed by tighter tolerance, eigendecompose the Gram matrix.

Case B — sequence kernel, NLP relation extraction, 2020. A “tree kernel” built as a sum of subtree counts was failing PSD on long sentences. Diagnosis: the implementation was missing a square in the bag-of-subtrees inner product, effectively computing $\sum_t \#_t(s) - \sum_t \#_t(t)$ rather than $\sum_t \#_t(s) \cdot \#_t(t).$ The Gram matrix entry was an unsigned count difference, not a similarity. Fix: one-line patch to make the kernel a proper Rule-5 lift. Lesson: when you write a custom kernel, write the explicit feature map next to it and verify $K(x, y) = \Phi(x)^\top \Phi(y)$ on three small examples.

Case C — Nystrom approximation, image features, 2021. A Nystrom-approximated RBF SVM ran fine on CIFAR-10 but blew up on STL-96 with LinAlgError: leading minor not positive definite. Diagnosis: with $m = 500$ Nystrom landmarks and 96-dim features, the Nystrom kernel matrix $\hat K_{mm}$ was rank-deficient because many landmarks were near-duplicates. Numerically, $\hat K$ had several eigenvalues at $10^{-14},$ below the jitter tolerance. Fix: increase jitter to $10^{-5}$ and de-duplicate landmarks with farthest-point sampling. Lesson: low-rank approximations preserve PSD only when the approximation is well-conditioned; add jitter generously.

Cross-cutting takeaway. All three failures are different surface symptoms of the same underlying fact: the optimiser assumed a PSD Gram matrix and got a sign-indefinite one. The fix is always either “use a proper PSD kernel” or “regularise the Gram matrix to make it PSD numerically.” Knowing the Mercer story tells you which knob to turn.

Comparison table — kernels at a glance#

A one-screen summary for picking a kernel under deadline:

KernelFormulaPSD argumentSpectrumHyperparamsBest for
Linear$x^\top y$$A^\top A$ is PSDrank $d$noneHigh-d sparse data
Polynomial$(x^\top y + c)^d$Rules 1, 2, 3, 6rank $\binom{p+d}{d}$$c, d$Image patches, ranking
RBF / Gaussian$\exp(-\gamma \|x - y\|^2)$Bochnerexp decay$\gamma$Default for $n < 10^5$
Laplace$\exp(-\gamma \|x - y\|_1)$Bochner (Cauchy spec)$\lambda_k \sim k^{-2}$$\gamma$Rough signals
Matern $\nu$scaled BesselBochner (Matern spec)$\lambda_k \sim k^{-(2\nu + d)/d}$$\nu, \ell$GP regression with smoothness control
Periodic$\exp(-\tfrac{2 \sin^2(\pix-y/p)}{\ell^2})$Bochnerline spectrum at multiples of $1/p$
String / spectrum$\sum_w \#_w(s) \#_w(t)$Rule 5 (lift from linear)finite, $\Sigma^k$
Sigmoid$\tanh(\alpha x^\top y + c)$NOT PSD in generalindefinite$\alpha, c$Avoid
Cosine of distance$\cos(\alpha \|x - y\|)$NOT PSDindefinite$\alpha$Counter-example only

The “PSD argument” column is the single most important one for theoretical work — it tells you which Mercer / Bochner / closure rule guarantees validity. The “spectrum” column is what controls the effective dimension and therefore the generalisation behaviour discussed in Parts 4 and 5.

Worked example — Mercer expansion of the degree-2 polynomial kernel#

$$K(x, y) = (x_1 y_1 + x_2 y_2 + 1)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2 + 2 x_1 y_1 + 2 x_2 y_2 + 1.$$ $$\Phi(x) = \big(x_1^2,\; x_2^2,\; \sqrt{2}\, x_1 x_2,\; \sqrt{2}\, x_1,\; \sqrt{2}\, x_2,\; 1\big) \in \mathbb{R}^6.$$

The $\sqrt{2}$ factors are there because the cross-terms get counted twice when you expand the dot product. Verification: $\Phi(x)^\top \Phi(y) = x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2 + 2 x_1 y_1 + 2 x_2 y_2 + 1,$ matching $K$ exactly.

The Mercer eigenfunctions in this case are the six normalised monomials $\{1, x_1, x_2, x_1^2, x_2^2, x_1 x_2\}$ (up to scaling by the underlying measure on $\mathcal{X}$ ); the eigenvalues are the coefficients $\{1, 2, 2, 1, 1, 2\}.$ This recovers the count $\binom{2 + 2}{2} = 6$ predicted by the polynomial-rank formula. For degree $d$ on $\mathbb{R}^p,$ the same construction gives the $\binom{p + d}{d}$ -dimensional explicit feature map; for $p = 100, d = 4,$ this is over 4 million features — which is exactly why people use the kernel trick instead of materialising $\Phi.$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
from itertools import combinations_with_replacement

def poly_feature_map(x, d):
    """Explicit feature map for (x^T y + 1)^d kernel on R^p."""
    p = len(x)
    feats = []
    for k in range(d + 1):
        for idx in combinations_with_replacement(range(p), k):
            # multinomial coefficient ensures K_explicit matches kernel
            from math import factorial
            counts = np.bincount(idx, minlength=p)
            coef = factorial(d) / (factorial(d - k) * np.prod([factorial(c) for c in counts]))
            feats.append(np.sqrt(coef) * np.prod([x[i] for i in idx]))
    return np.array(feats)

x = np.array([0.7, -0.3])
y = np.array([1.1,  0.4])
K_direct = (x @ y + 1) ** 2
K_explicit = poly_feature_map(x, 2) @ poly_feature_map(y, 2)
print(K_direct, K_explicit)            # should agree to floating point

This is the rare case where the implicit feature map is finite and you can write it out. For RBF the same exercise is impossible — the feature map has infinitely many components and you have to settle for Nystrom or random-feature approximations.

FAQ — questions students reliably ask#

Q: If positive-definite is the condition, why do textbooks sometimes say “conditionally positive definite” (CPD)? A: CPD relaxes the PSD condition to hold only for weight vectors $c$ with $\sum_i c_i = 0.$ Many distance-like functions, e.g. $-\|x - y\|^2,$ are CPD but not PSD. Krein-space SVMs and “indefinite kernel learning” work in this relaxed setting. For first-pass intuition, treat CPD as a footnote; PSD is what 99% of the kernel literature actually uses.

Q: My Gram matrix has $\lambda_{\min} = -10^{-13}.$ Is my kernel broken? A: Almost certainly not. That number is below the relative precision of float64. Add jitter $\epsilon I$ with $\epsilon \sim 10^{-6}$ and move on. If $\lambda_{\min}$ is bigger than $-10^{-8} \cdot \|G\|_2,$ the kernel is “PSD up to floating-point noise” and you can treat it as PSD operationally.

Q: Can I make any symmetric function into a kernel by adding $\lambda I$ for large $\lambda$ ? A: Yes — this is called “kernel ridge regularisation” or “spectral clipping” and it always works. The downside is that you have changed the kernel: the new function is $K'(x, y) = K(x, y) + \lambda \mathbb{1}[x = y],$ which usually does not have a natural interpretation. Better to pick a kernel that is PSD by construction.

Q: Does Mercer’s theorem require $\mathcal{X}$ compact? A: The classical version yes. Generalisations exist for non-compact domains (Mercer–Hilbert–Schmidt, integral operators with respect to a probability measure), and they give analogous spectral decompositions whenever the kernel is square-integrable. In practice, restricting to a compact bounding box is fine for data-driven applications.

Q: How does PSD interact with feature scaling? A: Scaling features $x \to D x$ for a diagonal $D$ corresponds to changing the kernel from $K(x, y)$ to $K(Dx, Dy).$ By Rule 5 this preserves PSD. So feature scaling is always safe with kernels — but it changes the effective bandwidth, so cross-validate $\gamma$ after re-scaling.

Q: Why do GP libraries always show a jitter or noise term? A: That is the $\epsilon I$ jitter discussed above, dressed up as a hyperparameter. In GPs it has a Bayesian interpretation as observation noise variance, but operationally it serves the same numerical purpose: keep the Cholesky factorisation from failing on near-singular Gram matrices.

Q: My SVM converged but the score on held-out data is awful — could the kernel be the cause? A: Three things to check, in order. First, eigendecompose the Gram matrix on a small subsample and look at $\lambda_{\min}.$ If it is significantly negative, you have an indefinite kernel and the converged “solution” is a saddle point of the dual. Second, look at the condition number $\kappa(G) = \lambda_{\max} / \lambda_{\min}.$ If $\kappa > 10^{10}$ the solver is fighting numerical noise; either add jitter or normalise features. Third, plot the spectrum: if all but one eigenvalue are essentially zero, your kernel is collapsing to rank one (e.g. RBF with $\gamma \to 0$ ) and is not separating classes. Each diagnostic points to a different fix.

Q: Is there a closed-form Mercer expansion for any non-trivial kernel? A: Yes for a small list. RBF on $\mathbb{R}$ with Gaussian measure has Hermite polynomials as eigenfunctions with explicitly known eigenvalues (Mehler’s formula). Polynomial kernel has monomials as eigenfunctions (worked out above). Periodic kernel on the circle has sinusoids. Most other kernels have no closed form and require numerical or asymptotic analysis. The Hermite-RBF case is by far the most studied because it is the natural pairing for Gaussian-distributed data.

Q: What is the “kernel mean embedding” people keep referring to in modern papers? A: Given a probability distribution $P$ on $\mathcal{X},$ its mean embedding is $\mu_P(\cdot) = \int K(\cdot, y) dP(y) \in \mathcal{H}.$ If the kernel is “characteristic” (RBF with $\gamma > 0$ qualifies), the map $P \mapsto \mu_P$ is injective, so distributions are uniquely identified by their kernel mean. This is the foundation of MMD two-sample tests, kernel ABC, and a lot of modern Bayesian inference. It is downstream of Mercer in the same way that all of $L^2$ analysis is downstream of the spectral theorem.

Q: When I have $n = 10^6$ points, can I still use Mercer-based thinking? A: Yes, but you stop materialising the full Gram matrix. Instead you sample $m \ll n$ Mercer modes — either via Nystrom (sample columns of $G$ ) or random features (sample frequencies $\omega \sim \mu$ from the Bochner spectral density) — and work with the resulting $m$ -dim feature representation. The accuracy of either approximation is controlled by the tail of the Mercer spectrum: if $\sum_{k > m} \lambda_k$ is small, the approximation is faithful. For RBF on smooth data, $m = O(\log n)$ usually suffices; for Laplace, $m = O(\sqrt{n})$ is closer to the truth.

Convergence of Mercer in practice#

$$\Big\| K - \sum_{k=1}^{r} \lambda_k \phi_k \otimes \phi_k \Big\|_{\text{op}} \;=\; \lambda_{r+1}.$$

So the rate is set entirely by the spectral decay. Three regimes worth knowing:

  • Exponential decay ($\lambda_k \sim e^{-c k}$ , e.g. RBF on smooth data): truncation error halves every $O(1/c)$ added modes. Practical low-rank approximations work spectacularly well.
  • Polynomial decay ($\lambda_k \sim k^{-\alpha}$ , e.g. Laplace, Matern with low $\nu$ ): truncation error decreases slowly. Need many more modes for a given accuracy. This is the warning sign that Nystrom or RFF will need a large $m.$
  • Heavy-tailed / no decay: kernel is essentially full-rank on the data. Either the data is genuinely high-dimensional (think one-hot encoded categorical with many levels) or the kernel was poorly chosen.

This spectral picture is also the right way to read kernel learning curves. A common diagnostic plot is “log $\lambda_k$ vs $k$ ” on a sample of held-out data: a straight line means exponential decay, a curve means polynomial. Both are informative; flat is the alarm signal.

Closing thoughts#

The two theorems of this post — the Gram-matrix characterisation of PSD kernels, and Mercer’s spectral decomposition — together do all the heavy lifting that makes the rest of kernel theory work. Everything downstream (RKHS, representer theorem, kernel ridge bounds, GP marginal likelihoods, Nystrom error analysis) is a corollary or refinement of these two facts. If you can recite both in your sleep and convert between the explicit feature map and the kernel formula on a degree-2 polynomial without looking it up, you are equipped to read any kernel-methods paper from the last thirty years.

The historical thread is worth knowing too. Mercer’s 1909 paper was buried in functional analysis until the 1990s when Vapnik et al. realised the kernel trick made SVMs computationally feasible. The “kernel revolution” of 1995–2005 was largely a story of re-deriving every classical ML method (PCA, CCA, logistic regression, ICA) in the kernel framework and discovering each time that the same Mercer theorem licensed the move. The post-2017 NTK era is the third wave: rediscovering that wide neural networks are also kernel methods in disguise. The math has not moved much; the appreciation for what it makes possible keeps expanding.

If you want a single mental model for the rest of the series, here it is. A kernel is a PSD function. Mercer turns it into an orthogonal eigendecomposition with eigenvalues $\{\lambda_k\}$ and eigenfunctions $\{\phi_k\}.$ The eigenvalues control capacity; the eigenfunctions are the implicit features. Every kernel method you will meet — SVM, KRR, GP, kernel PCA, MMD, characteristic embeddings — is a different way of using $\{\lambda_k, \phi_k\}$ under the hood, exposed to the user only through the Gram matrix $G_{ij} = K(x_i, x_j).$ If you can keep this picture in mind, the rest of the series is just unpacking the algorithm-by-algorithm consequences.

What’s next#

You now know what a kernel is, how to test one, and where its hidden feature space comes from. Part 3 takes the same Mercer expansion and reorganises it into the Reproducing Kernel Hilbert Space (RKHS) — the actual function class that kernel SVMs, Gaussian processes, and kernel ridge regression all live in. The RKHS view is what gives you the representer theorem, the structural reason why kernel methods are computationally tractable despite working in (sometimes infinite) feature spaces.

If you walk away with one thing, let it be this. PSD is not a technicality. It is the entire reason the kernel trick is allowed. Get the Gram matrix right and everything else — SVMs, GPs, KRR, kernel PCA — follows from the same handful of theorems. Get it wrong and your solver will tell you, loudly. The whole rest of this series is going to assume you have absorbed these two facts; if any future post stops making sense, scroll back here and re-read the Mercer section.


This is Part 2 of Kernel Methods (8 parts). Previous: Part 1 — Why We Need Kernel Methods · Next: Part 3 — RKHS: The Theoretical Soul of Kernel Methods

In this series

Kernel Methods 8 parts

  1. 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
  2. 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem you are here
  3. 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
  4. 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
  5. 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
  6. 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
  7. 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
  8. 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub