
Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
Reproducing Kernel Hilbert Space — the function space where kernel methods live. The reproducing property, the representer theorem, and why finite-data optimization works in infinite dimensions.
If your eyes glaze over the moment a lecturer writes “RKHS” on the board, this part of the series is for you. RKHS is not a club of three intimidating letters — it is a function space, and once you see what lives inside it, kernel methods stop feeling like magic and start feeling like linear algebra you already know.

The aim of this article is small and focused. We want to answer three questions: what function space does a kernel define, what can you do inside it (the reproducing property and the representer theorem), and why that gives you a finite optimisation problem even when the space is infinite-dimensional. By the end you should be able to read a paper that says “the optimal $f$ lies in $\mathcal{H}_K$ ” without flinching.
Hilbert space in five minutes#
Before we can talk about a reproducing kernel Hilbert space, we need the plain version. A Hilbert space is a vector space with two extra layers on top.
Layer 1: inner product. A Hilbert space $\mathcal{H}$ has an inner product $\langle \cdot, \cdot \rangle_{\mathcal{H}}$ — a bilinear, symmetric, positive-definite function that takes two vectors and returns a scalar. The inner product gives you length ($\|f\| = \sqrt{\langle f, f\rangle}$ ), angle ($\cos\theta = \langle f, g\rangle / (\|f\|\|g\|)$ ), and orthogonality ($f \perp g \iff \langle f, g\rangle = 0$ ). Everything geometric you know in $\mathbb{R}^d$ still works.
Layer 2: completeness. Every Cauchy sequence in $\mathcal{H}$ converges to a limit inside $\mathcal{H}$ . This sounds technical but it is what separates a Hilbert space from a merely “inner-product space” — completeness means you can take limits without falling out of the space, which lets you talk about infinite sums, integrals, and infinite-dimensional projections without anxiety.
The two canonical examples sit at opposite extremes of dimension:
- Finite-dimensional: $\mathbb{R}^d$ with the dot product $\langle u, v\rangle = u^\top v$ . Familiar.
- Infinite-dimensional ($L^2$ ): $L^2([a,b])$ , the space of square-integrable real-valued functions on $[a, b]$ , with the inner product $\langle f, g\rangle_{L^2} = \int_a^b f(x) g(x)\, dx$ . Vectors here are functions; their inner product is an integral.
- Infinite-dimensional ($\ell^2$ ): $\ell^2(\mathbb{N})$ , the space of square-summable real sequences $a = (a_1, a_2, \dots)$ with $\sum_k a_k^2 < \infty$ , with the inner product $\langle a, b\rangle_{\ell^2} = \sum_{k=1}^\infty a_k b_k$ . Vectors here are infinite sequences; their inner product is an infinite sum.
The two infinite-dimensional examples are not just academic. $L^2$ is the natural home of Fourier analysis and quantum wavefunctions. $\ell^2$ is what you get if you expand any $L^2$ function in a fixed orthonormal basis and look at its coefficient sequence. By Parseval, the map “function to coefficients” is an isometric isomorphism $L^2 \cong \ell^2$ — they are the same Hilbert space wearing different clothes.
Completeness proof sketch (why $\ell^2$ is complete). Let $a^{(n)}$ be a Cauchy sequence in $\ell^2$ . Each coordinate $(a^{(n)}_k)_n$ is then Cauchy in $\mathbb{R}$ (because $|a^{(n)}_k - a^{(m)}_k| \leq \|a^{(n)} - a^{(m)}\|_{\ell^2}$ ), so converges to some $a_k$ . Set $a = (a_k)_k$ . A short $\varepsilon$ -argument plus Fatou’s lemma shows $a \in \ell^2$ and $a^{(n)} \to a$ in $\ell^2$ . The same template works for $L^2$ — Cauchy sequences pin down a pointwise-almost-everywhere limit which is then shown to live in the space.
The picture is worth a paragraph. The left panel is $\mathbb{R}^2$ with two arrows; the inner product is their dot product, and you can read off the projection $\mathrm{proj}_u v = (\langle u, v\rangle / \|u\|^2) u$ as a third arrow. The right panel does exactly the same thing but with functions: the two coloured curves are $f$ and $g$ , the shaded area is $f(x) g(x)$ , and the inner product $\langle f, g\rangle$ is the signed area under that product curve. Same geometry, infinite dimensions.
Tiny numerical demo of the $L^2$ inner product. A 20-line snippet to make the abstraction concrete:
| |
You should view the integral approximation $\sum_i u_i v_i \cdot \Delta x$ as the bridge from $\ell^2$ to $L^2$ : increase the resolution and the $\ell^2$ inner product of the discretised functions converges to the $L^2$ inner product of the originals. This is exactly the isometry $\ell^2 \cong L^2$ in numerical form.
Why we care. Kernel methods will live inside a particular Hilbert space — a space whose “vectors” are real-valued functions on whatever set $\mathcal{X}$ your data lives on. The inner product on that space will turn out to be exactly the kernel $K$ you choose. The completeness condition is what guarantees that limits of finite sums of kernels (which is what kernel ridge regression actually outputs) are still legitimate members of the space.
A tiny mental model: think of a Hilbert space as “$\mathbb{R}^\infty$ with an inner product and no convergence pathology.” If that sentence settles in your gut, you are ready for what follows.
| Property | $\mathbb{R}^d$ | $\ell^2$ | $L^2([a,b])$ | $\mathcal{H}_K$ (RKHS) |
|---|---|---|---|---|
| Vectors | $d$ -tuples | infinite sequences | functions (eq. classes) | functions (pointwise) |
| Inner product | $u^\top v$ | $\sum_k a_k b_k$ | $\int f g \, dx$ | $\langle f, g\rangle_{\mathcal{H}_K}$ |
| Dimension | $d$ | $\infty$ | $\infty$ | depends on $K$ |
| Point eval $f \mapsto f(x)$ | trivial | n/a | not defined | continuous |
That last row is what makes the RKHS special, and it is the next section.
What is RKHS: the reproducing property#
A Hilbert space is a generic construction; an RKHS is a Hilbert space of functions with one extra property — the reproducing property — that makes function evaluation behave like an inner product.
Here is the formal statement. A Hilbert space $\mathcal{H}$ of real-valued functions on $\mathcal{X}$ is a reproducing kernel Hilbert space if there exists a function $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ — the reproducing kernel — such that for every $x \in \mathcal{X}$ and every $f \in \mathcal{H}$ ,
$$f(x) \;=\; \langle f, \; K(\cdot, x)\rangle_{\mathcal{H}}.$$Read that equation very carefully. The notation $K(\cdot, x)$ means: fix the second slot to $x$ , and treat the result as a function of the first slot — so $K(\cdot, x)$ is itself an element of $\mathcal{H}$ . It is the “evaluation feature” at $x$ . The claim is that evaluating any $f$ at $x$ is the same thing as taking the inner product of $f$ with that evaluation feature.
Why is this strange? Function evaluation $f \mapsto f(x)$ is usually a fragile operation. For a generic $L^2$ function, $f(x)$ at a single point is not even well-defined — you can change $f$ on a measure-zero set without changing it as an $L^2$ vector. RKHS demands that the space be small enough that pointwise evaluation is not only well-defined, but is a continuous linear functional. The Riesz representation theorem then says every continuous linear functional is given by an inner product with some element, and that element is exactly $K(\cdot, x)$ . RKHS = “Hilbert space where point evaluation is continuous.”
The picture. I find the easiest way to see it is concretely.

Left panel: the blue curve is some $f \in \mathcal{H}_K$ constructed as a sum of three RBF kernels centred at $\{-1.6, 0.2, 1.4\}$ with chosen coefficients. The red curve is the kernel slice $K(\cdot, x_0)$ at $x_0 = 0.6$ — a single Gaussian bump centred there. The green dot marks $f(x_0)$ , the value the blue curve takes at $x_0$ . Middle panel: the equation says exactly that the green dot equals the inner product of blue and red. Right panel verifies it numerically — by the structure of $f$ , the inner product reduces to $\sum_i \alpha_i K(x_i, x_0)$ , which equals the green dot.
Worked example: the linear kernel $K(x,y) = xy$ on $\mathbb{R}$ , end-to-end. Take $\mathcal{X} = \mathbb{R}$ and $K(x, y) = xy$ . We will derive $\mathcal{H}_K$ , its inner product, and verify the reproducing identity by hand.
The kernel slice at $y$ is $K(\cdot, y): z \mapsto zy$ — a linear function through the origin with slope $y$ . So every kernel slice is of the form $f_w(z) = wz$ for some slope $w \in \mathbb{R}$ . Finite linear combinations $\sum_i \alpha_i K(\cdot, x_i)$ are $z \mapsto (\sum_i \alpha_i x_i) z$ , again a linear-through-origin function with slope $w = \sum_i \alpha_i x_i$ . The pre-Hilbert space $\mathcal{H}_0$ is therefore $\{f_w : w \in \mathbb{R}\} \cong \mathbb{R}$ , and it is already complete (it is one-dimensional, so trivially so) — no completion is needed. $\mathcal{H}_K = \mathcal{H}_0$ .
The induced inner product is
$$\langle f_w, f_v\rangle_{\mathcal{H}_K} \;=\; \langle \alpha K(\cdot, w), \beta K(\cdot, v)\rangle_{\mathcal{H}_0} \;=\; \alpha \beta K(w, v) \;=\; \alpha \beta \, w v.$$Choosing $\alpha = 1, x_1 = w$ (so $f_w = K(\cdot, w)$ ) and similarly $\beta = 1, y_1 = v$ gives $\langle f_w, f_v\rangle = wv$ — just multiplication of slopes. Now verify reproducing: $\langle f_w, K(\cdot, x)\rangle = \langle f_w, f_x\rangle = wx = f_w(x)$ . ✓
This collapses to the punchline: the RKHS of $K(x,y) = xy$ is $\mathbb{R}$ itself, where each real number $w$ is identified with the linear function $z \mapsto wz$ . The same logic extended to $\mathbb{R}^d$ gives $\mathcal{H}_K \cong \mathbb{R}^d$ for $K(x,y) = x^\top y$ , with $\langle f_w, f_v\rangle = w^\top v$ . The “infinite-dimensional Hilbert space” is just $\mathbb{R}^d$ wearing a function-space costume.
Why this matters operationally. Once you accept the reproducing property, you can do something powerful: replace any pointwise evaluation in a derivation with an inner product. That is the move that turns “what is $f$ ?” (a function-space search) into “what is the vector $\sum_i \alpha_i K(\cdot, x_i)$ ?” (a finite-dimensional search). Hold on to this — it is the single most important fact in this article.
Constructing the RKHS from the kernel#
We have been talking as if “the RKHS of a kernel $K$ ” obviously exists. Let’s actually build it. The construction is short and concrete and worth doing once in your life.
Step 1: start with kernel slices. Pick any finite set of points $\{x_1, \dots, x_n\} \subset \mathcal{X}$ . Consider the $n$ functions $K(\cdot, x_1), \dots, K(\cdot, x_n)$ . These are our basis vectors.
Step 2: form finite linear combinations. Define
$$\mathcal{H}_0 \;=\; \left\{ \sum_{i=1}^n \alpha_i K(\cdot, x_i) \;:\; n \in \mathbb{N},\; x_i \in \mathcal{X},\; \alpha_i \in \mathbb{R}\right\}.$$This is a vector space — a sum of two such combinations is another such combination. We are not yet a Hilbert space; we have no inner product and no completeness.
Step 3: define the inner product. For two elements $f = \sum_i \alpha_i K(\cdot, x_i)$ and $g = \sum_j \beta_j K(\cdot, y_j)$ of $\mathcal{H}_0$ , define
$$\langle f, g\rangle_{\mathcal{H}_0} \;=\; \sum_{i,j} \alpha_i \beta_j \, K(x_i, y_j).$$This is well-defined (does not depend on how you write $f$ as a combination) precisely because $K$ is positive definite — that is exactly the condition needed for this bilinear form to be a valid inner product. Symmetry of $K$ gives bilinearity and symmetry; positive-definiteness gives $\langle f, f\rangle \geq 0$ with equality only when $f \equiv 0$ .
Step 4: check the reproducing property holds on $\mathcal{H}_0$ . Take $f = \sum_i \alpha_i K(\cdot, x_i)$ and any $x \in \mathcal{X}$ . Then
$$\langle f, K(\cdot, x)\rangle_{\mathcal{H}_0} \;=\; \sum_i \alpha_i K(x_i, x) \;=\; f(x).$$It works by construction.
Step 5: complete it. $\mathcal{H}_0$ is an inner-product space but is not complete. There exist Cauchy sequences of finite combinations whose pointwise limits are functions not expressible as finite combinations — for an RBF kernel, this includes most smooth functions on $\mathcal{X}$ . The standard analysis move is to add in all such limits: define $\mathcal{H}_K$ as the completion of $\mathcal{H}_0$ with respect to the norm $\|f\|_{\mathcal{H}_0} = \sqrt{\langle f, f\rangle}$ . The inner product and the reproducing property both extend by continuity to $\mathcal{H}_K$ .
Complexity of the construction. Although the limit space $\mathcal{H}_K$ is infinite-dimensional in general, every element you ever construct from data is in $\mathcal{H}_0$ — a finite combination of $n$ kernel slices. Inner-product computation between two such elements is $O(n^2)$ kernel evaluations; norm computation is the same; evaluation of $f$ at a new point $x$ is $O(n)$ . The infinite-dimensional space lives only in the theoretical proof; the runtime is always polynomial in $n$ .
Python: simulating a finite-rank RKHS. Below we build the pre-Hilbert space $\mathcal{H}_0$ for an RBF kernel over five fixed centres and numerically verify the reproducing property.
| |
Run it and the two numbers agree to machine precision. The reproducing identity is not a metaphor; it is a literal equality that holds in code.

The figure walks through the three constructive steps for an RBF kernel. Left: the kernel slices at three points are smooth bumps centred at those points. Middle: linear combinations of those bumps form a small, finite-dimensional space $\mathcal{H}_0$ of “spike-like” functions. Right: by taking finer grids and more bumps, Cauchy sequences in $\mathcal{H}_0$ converge to arbitrary smooth functions in $\mathcal{H}_K$ — the limit is the full RKHS.
The theorem that wraps this up is Moore–Aronszajn: every positive-definite kernel $K$ defines a unique RKHS $\mathcal{H}_K$ for which $K$ is the reproducing kernel, and vice versa. The correspondence is one-to-one — kernels and RKHSs are two sides of the same coin.
RKHS dimension catalogue.
| Kernel | $\mathcal{X}$ | $\dim \mathcal{H}_K$ |
|---|---|---|
| Linear $x^\top y$ | $\mathbb{R}^d$ | $d$ |
| Polynomial degree $p$ | $\mathbb{R}^d$ | $\binom{d+p}{p}$ |
| RBF $\exp(-\gamma\|x-y\|^2)$ | $\mathbb{R}^d$ | $\infty$ |
| Matern $\nu = 3/2$ | $\mathbb{R}^d$ | $\infty$ |
| Periodic | $\mathbb{R}$ | $\infty$ |
The kernel chooses both the function space and its size; you do not pick them independently.
Mercer’s view versus the reproducing view#
We now have two seemingly different ways to talk about what is “inside” a kernel:
- Mercer’s view (Part 2): $K(x, y) = \sum_k \lambda_k \phi_k(x) \phi_k(y)$ , with the implicit feature map $\phi(x) = (\sqrt{\lambda_k}\,\phi_k(x))_k$ .
- Reproducing view (this part): $K$ is the reproducing kernel of a function space $\mathcal{H}_K$ , and $f(x) = \langle f, K(\cdot, x)\rangle$ .
Are these really the same object? Yes — and the bridge is elegant.
Bridge. Given Mercer’s decomposition, the natural feature map $\phi: \mathcal{X} \to \ell^2$ sends $x$ to the infinite sequence $(\sqrt{\lambda_k}\,\phi_k(x))_k$ . The map from features to functions sends a sequence $a = (a_k)_k \in \ell^2$ to the function $f_a(x) = \sum_k a_k \sqrt{\lambda_k}\, \phi_k(x)$ . The RKHS $\mathcal{H}_K$ is the image of this map — all functions of the form $f_a$ for $a \in \ell^2$ — with norm $\|f_a\|_{\mathcal{H}_K} = \|a\|_{\ell^2}$ . The reproducing property is then a one-line computation:
$$\langle f_a, K(\cdot, x)\rangle_{\mathcal{H}_K} \;=\; \langle a, \phi(x)\rangle_{\ell^2} \;=\; \sum_k a_k \sqrt{\lambda_k}\,\phi_k(x) \;=\; f_a(x). \;\;\checkmark$$So the Mercer feature map and the RKHS are linked by an isometric isomorphism: every $f \in \mathcal{H}_K$ corresponds to a coefficient vector $a \in \ell^2$ , and the RKHS inner product corresponds to the $\ell^2$ inner product on coefficients. Mercer gives you a basis for the RKHS; the reproducing view gives you an evaluation rule.
The isomorphism, drawn. Here is the picture worth memorising:

Figure: Mercer view and reproducing view are two parameterisations of the same RKHS — the inner products on each end are equal.
The double bars mean “equal in value”. You can compute the RKHS inner product of two functions by computing the $\ell^2$ inner product of their coefficient sequences and vice versa. The two pictures are not in tension — they live on opposite sides of the same isometry.
Worked example: degree-2 polynomial kernel on $\mathbb{R}^2$ . Take $K(x, y) = (x^\top y)^2$ on $\mathcal{X} = \mathbb{R}^2$ . Expanding,
$$K(x, y) = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + 2 x_1 x_2 y_1 y_2 + x_2^2 y_2^2 = \phi(x)^\top \phi(y),$$where $\phi(x) = (x_1^2, \sqrt{2}\, x_1 x_2, x_2^2)^\top \in \mathbb{R}^3$ . The Mercer eigenfunctions are $\phi_1 = x_1^2$ , $\phi_2 = \sqrt{2}\, x_1 x_2$ , $\phi_3 = x_2^2$ , all with eigenvalue $\lambda = 1$ . The RKHS $\mathcal{H}_K$ is the 3-dimensional space of homogeneous quadratics
$$\mathcal{H}_K = \{f_a(x) = a_1 x_1^2 + a_2 \sqrt{2}\, x_1 x_2 + a_3 x_2^2 : a \in \mathbb{R}^3\},$$with $\|f_a\|_{\mathcal{H}_K}^2 = a_1^2 + a_2^2 + a_3^2 = \|a\|_{\ell^2}^2$ . Reproducing check: $\langle f_a, K(\cdot, x)\rangle = \langle f_a, f_{\phi(x)}\rangle = a^\top \phi(x) = f_a(x)$ . ✓ Both views agree, and you can compute in whichever is more convenient — the basis view is short here because $\dim \mathcal{H}_K = 3$ .
Which view to use when?
- Use Mercer when you want to think geometrically about the implicit feature space — eigenvalue decay, spectral approximations (Nystrom, random Fourier features), or analysing what frequencies a kernel can represent.
- Use the reproducing view when you want to prove things about the optimisation problem — existence of minimisers, representer theorems, generalisation bounds. The inner-product calculus is what makes these proofs short.
They are not competitors; they are different lenses on the same Hilbert space. A working researcher freely switches between them mid-derivation.
The Representer Theorem#
Here is the result that everything in this article has been building towards. It is one of those rare statements that is short, easy to prove, and immediately changes what you can do with kernels.
Theorem (Representer, Kimeldorf–Wahba 1971; Schölkopf–Herbrich–Smola 2001). Let $\mathcal{H}_K$ be the RKHS of a kernel $K$ , let $\{(x_i, y_i)\}_{i=1}^n$ be a finite training set, and let $L: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$ be any function (the empirical loss) and $\Omega: [0, \infty) \to \mathbb{R}$ be a strictly increasing function (the regulariser). Then any minimiser $f^*$ of
$$\min_{f \in \mathcal{H}_K} \;L\bigl(f(x_1), \dots, f(x_n)\bigr) \;+\; \Omega\bigl(\|f\|_{\mathcal{H}_K}\bigr)$$admits the representation
$$f^*(x) \;=\; \sum_{i=1}^n \alpha_i \, K(x_i, x)$$for some $\alpha \in \mathbb{R}^n$ .
In words: no matter how exotic the loss, no matter how high-dimensional or infinite-dimensional $\mathcal{H}_K$ is, the optimiser is always a finite linear combination of training-data kernel slices.
Full proof by orthogonality. Let $V = \mathrm{span}\{K(\cdot, x_i)\}_{i=1}^n \subset \mathcal{H}_K$ — a finite-dimensional subspace of size at most $n$ . Since $V$ is closed (finite-dim subspaces of Hilbert spaces always are), every $f \in \mathcal{H}_K$ has a unique orthogonal decomposition
$$f \;=\; f_\parallel + f_\perp, \quad f_\parallel \in V, \quad f_\perp \in V^\perp,$$where $V^\perp = \{g \in \mathcal{H}_K : \langle g, h\rangle = 0 \text{ for all } h \in V\}$ .
Now compute what each piece contributes to the objective.
Empirical loss term. By the reproducing property,
$$f(x_i) \;=\; \langle f, K(\cdot, x_i)\rangle \;=\; \langle f_\parallel, K(\cdot, x_i)\rangle + \langle f_\perp, K(\cdot, x_i)\rangle.$$Since $K(\cdot, x_i) \in V$ and $f_\perp \in V^\perp$ , the second inner product is exactly zero. Therefore $f(x_i) = f_\parallel(x_i)$ for every $i$ , which means $L(f(x_1), \dots, f(x_n)) = L(f_\parallel(x_1), \dots, f_\parallel(x_n))$ . The loss term depends only on $f_\parallel$ .
Regulariser term. By Pythagoras (which holds in any Hilbert space because of orthogonality),
$$\|f\|^2 \;=\; \|f_\parallel\|^2 + \|f_\perp\|^2 \;\geq\; \|f_\parallel\|^2,$$with equality if and only if $f_\perp = 0$ . Since $\Omega$ is strictly increasing,
$$\Omega(\|f\|) \;\geq\; \Omega(\|f_\parallel\|),$$again with equality iff $f_\perp = 0$ .
Combining. The total objective satisfies
$$L(f(x_1),\dots) + \Omega(\|f\|) \;\geq\; L(f_\parallel(x_1),\dots) + \Omega(\|f_\parallel\|),$$with strict inequality whenever $f_\perp \neq 0$ . So any minimiser must have $f_\perp = 0$ , which is exactly $f^* \in V = \mathrm{span}\{K(\cdot, x_i)\}$ , i.e., $f^*(x) = \sum_i \alpha_i K(x_i, x)$ . $\square$
The proof uses three ingredients only — orthogonal decomposition, the reproducing property, and the monotonicity of $\Omega$ . Everything else about $\mathcal{H}_K$ , including its possibly infinite dimension, is irrelevant.

The figure makes the orthogonality argument visceral. The big dashed ellipse is $\mathcal{H}_K$ . The blue line is the $n$ -dimensional subspace $V$ spanned by the training-data kernels. A candidate $f$ sits somewhere in $\mathcal{H}_K$ . Decompose it into its projection onto $V$ (green) and its perpendicular complement (purple). The perpendicular part contributes nothing to the data fit (by the reproducing property) but pays a positive cost in the norm penalty. The optimiser kills it.
Why the Representer Theorem matters#
A theorem is only as good as what it lets you do. Here the consequences are large.
Infinite-dim search becomes $n$ -dim solve. The naive way to think about kernel-ridge or kernel-SVM is: “we are minimising over $\mathcal{H}_K$ , which could be infinite-dimensional, that sounds impossible.” The representer theorem replies: “no, you are really minimising over the $n$ -dim subspace spanned by your training kernels, that is a finite vector space, you can solve it with linear algebra.” The minimisation reduces from an abstract function-space problem to a concrete $\mathbb{R}^n$ optimisation in the coefficients $\alpha$ .
You never need to materialise $\phi(x)$ . Even if the implicit feature map $\phi(x)$ has $10^6$ or infinitely many coordinates, you never write it down. The representer theorem says the answer can be expressed entirely in terms of kernel evaluations $K(x_i, x)$ , which are $O(1)$ to compute. This is the structural reason kernel methods sidestep the curse of dimensionality.
Application 1: kernel ridge regression, derived end-to-end#
Let’s see the theorem in action. Solve
$$\min_{f \in \mathcal{H}_K} \;\sum_{i=1}^n (y_i - f(x_i))^2 \;+\; \lambda \|f\|^2_{\mathcal{H}_K}.$$By the representer theorem, $f(x) = \sum_i \alpha_i K(x_i, x)$ for some $\alpha \in \mathbb{R}^n$ . Plug it in. Let $K \in \mathbb{R}^{n \times n}$ be the Gram matrix $K_{ij} = K(x_i, x_j)$ and $\mathbf{k}(x)$ the column vector $\mathbf{k}(x)_i = K(x_i, x)$ . Then:
- $f(x_i) = (K\alpha)_i$ , so the data-fit term is $\|y - K\alpha\|^2$ .
- The RKHS norm is $\|f\|^2 = \sum_{i,j} \alpha_i \alpha_j K(x_i, x_j) = \alpha^\top K \alpha$ .
The problem becomes the finite-dim quadratic
$$\min_{\alpha \in \mathbb{R}^n} \;\|y - K\alpha\|^2 \;+\; \lambda \,\alpha^\top K \alpha.$$Setting the gradient with respect to $\alpha$ to zero:
$$-2 K (y - K\alpha) + 2\lambda K \alpha = 0 \;\;\Longrightarrow\;\; K(K + \lambda I)\alpha = K y \;\;\Longrightarrow\;\; \alpha = (K + \lambda I)^{-1} y.$$(Where we cancel one $K$ assuming it is invertible; the case where $K$ is rank-deficient gives the same predictor by a limiting argument.) The predictor is
$$\hat f(x) \;=\; \mathbf{k}(x)^\top (K + \lambda I)^{-1} y.$$This is the dual form. Every appearance of $\phi(x)$ has been eliminated; the entire pipeline runs on kernel evaluations.

Compare the two formulations side by side. Primal: solve a $D \times D$ system in feature space (where $D$ is the dimension of the implicit feature map, possibly infinite). Dual: solve an $n \times n$ system in coefficient space (where $n$ is your training-set size). When $D > n$ — which is almost always if you are using a non-trivial kernel — the dual is strictly cheaper, and when $D = \infty$ , the dual is the only feasible route.
Application 2: kernel SVM (the hinge-loss case)#
The same machinery applies to classification. The soft-margin kernel SVM solves
$$\min_{f \in \mathcal{H}_K, b \in \mathbb{R}} \;\frac{1}{2}\|f\|^2_{\mathcal{H}_K} + C \sum_{i=1}^n \max(0, 1 - y_i (f(x_i) + b)),$$with $y_i \in \{-1, +1\}$ . The hinge loss $\max(0, 1 - y_i z_i)$ depends on $f$ only through evaluations $f(x_i)$ , and the regulariser is strictly increasing in $\|f\|$ , so the representer theorem applies. The optimiser is again $f^*(x) = \sum_i \alpha_i K(x_i, x)$ .
Substituting and going through Lagrangian duality gives the classical SVM dual
$$\max_{\alpha \in \mathbb{R}^n} \;\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j),$$subject to $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i = 0$ . The decision function is $\hat f(x) = \sum_{i \in \text{SV}} \alpha_i y_i K(x_i, x) + b$ , summed only over support vectors (those with $\alpha_i > 0$ ). The KKT conditions are exactly what gives the sparsity: only training points on or inside the margin contribute. The representer theorem is what guarantees that a finite expansion suffices in the first place; KKT is what makes most coefficients zero.
| Method | Primal cost | Dual cost | Notes |
|---|---|---|---|
| Linear ridge | $O(d^2 n + d^3)$ | $O(n^3)$ | use primal if $d \ll n$ |
| Kernel ridge | infeasible if $D = \infty$ | $O(n^3)$ | dual is only option |
| Kernel SVM | infeasible if $D = \infty$ | $O(n^3)$ QP | sparse: support vectors |
| Kernel SVR | infeasible if $D = \infty$ | $O(n^3)$ QP | epsilon-tube |
The finite-data miracle.

The plot on the left shows a kernel-ridge fit to eight noisy training points. The green prediction is exactly a sum of eight scaled kernel slices (the faint blue curves) — one per training point. The plot on the right is the dimensional accounting: $\mathcal{H}_K$ is enormous (drawn schematically as $10^6$ ; for an RBF kernel it is genuinely infinite), but the subspace that contains the optimal $f$ is $n = 8$ -dimensional. You solve an $8 \times 8$ system; you sit inside an infinite-dimensional space; nothing breaks.
This is the structural fact behind every kernel SVM, every Gaussian process, every kernel PCA, every spectral clustering algorithm. It is the reason kernels are not just a cute trick but a deep theoretical framework.
What’s next#
We have now built the conceptual machine: positive-definite kernels exist (Part 2), they generate function spaces with nice evaluation rules (this part), and finite data lets us search those spaces with finite linear algebra (representer theorem).
The next obvious question is which kernel to actually use. There are families with different smoothness, periodicity, and locality properties — RBF, polynomial, Matern, periodic, linear, sigmoid — and each one shapes the implicit RKHS in a different way. Part 4 surveys the common kernel families, explains what each one assumes about your data, and gives you a default for every common setting.
This is Part 3 of Kernel Methods (8 parts). Previous: Part 2 — Mathematical Foundations · Next: Part 4 — Common Kernel Families
Kernel Methods 8 parts
- 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
- 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
- 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods you are here
- 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
- 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
- 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
- 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
- 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide