
Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
Gaussian Processes turn kernels into a Bayesian model — posterior with uncertainty, marginal likelihood for hyperparameters, and the kernel as a prior over functions.
Kernel ridge regression gives you a number. You feed it $x_*$ , it returns $\hat{y}_* = 23.7$ . End of story. But you wanted to act on that prediction — maybe schedule a delivery, dose a patient, place a bet — and the single number is not enough. Tomorrow’s temperature being “25°C” is useful; “very likely 25°C, 95% chance between 22 and 28” is actionable. Every decision under uncertainty needs the second one. Gaussian Processes are the cleanest way to upgrade a kernel method from “point predictor” to “distribution predictor”, and they do it without abandoning a single line of the kernel math from the previous five parts.

Part 5 closed the door on the frequentist kernel toolbox: SVM, PCA, ridge — all point estimators, all minimising some loss. This part opens the Bayesian door. The kernel becomes a prior over functions; the data updates that prior into a posterior; the posterior gives you not just a mean prediction but a full distribution at every test point. The magic is that everything is Gaussian, so everything stays closed-form. No MCMC, no variational lower bound, no Monte Carlo error — just one matrix inversion and you have analytic posterior mean, variance, and even the marginal likelihood for hyperparameter tuning.
What is a Gaussian Process: a distribution over functions#
$$f(\cdot) \sim \mathcal{GP}\!\left(m(\cdot),\; K(\cdot, \cdot)\right),$$ $$\mathbf{f} \;\sim\; \mathcal{N}\!\big(m(X),\; K(X, X)\big),$$where $m(X)_i = m(x_i)$ and $K(X, X)_{ij} = K(x_i, x_j).$ That’s it. The infinite-dimensional object “a random function” reduces, whenever you actually evaluate it, to a finite-dimensional Gaussian.
Why this works as a model. A finite-dimensional Gaussian is the most studied probability distribution in mathematics. Marginalising, conditioning, and integrating all stay Gaussian, all in closed form. Saying “$f$ is a GP” is equivalent to promising “every question you can ask me about $f$ at finitely many points is a Gaussian question”. That promise is generous enough to model almost any smooth function, and strict enough to give analytic answers.
$$f \sim \mathcal{GP}(0, K) \quad\Longleftrightarrow\quad \text{everything is in } K.$$The kernel IS the prior. Pick an RBF kernel and you’ve declared that $f$ is smooth, infinitely differentiable, with a characteristic length scale $\ell$ . Pick a Matérn-1/2 and you’ve declared $f$ is continuous but not differentiable. Pick a periodic kernel and you’ve declared $f$ repeats every $p$ units. Each kernel choice puts non-zero probability mass on a different class of functions; the data then picks out the most plausible member of that class. Part 4 was a menu of priors; Part 6 is what you do with them.
GP prior: sampling functions before seeing data#
$$\mathbf{f}_* \sim \mathcal{N}(0, K(X_*, X_*)).$$So sampling reduces to sampling from a multivariate Gaussian: Cholesky-factorise $K(X_*, X_*) = LL^\top$ , draw $\mathbf{z} \sim \mathcal{N}(0, I)$ , return $L\mathbf{z}$ .
| |

Three things to read off the figure. First, all samples pass through the zero mean on average — the prior has no preferred location. Second, the roughness is determined by the kernel: RBF samples are silky smooth, Matérn-5/2 has visible texture, periodic samples actually repeat. Third, the amplitude is controlled by the kernel’s output variance $\sigma_f^2$ (here $\sigma_f^2 = 1$ , so samples mostly stay within $\pm 2$ ).
The numerical jitter. The line K += 1e-6 * np.eye(n) is not optional. Mercer eigenvalues decay rapidly, so $K$
is almost always ill-conditioned or numerically rank-deficient. Adding a tiny diagonal — “jitter” — guarantees Cholesky succeeds without changing the model meaningfully. Standard values are $10^{-6}$
for double precision; if Cholesky still fails, bump it to $10^{-4}.$
This is the most common debugging step in any GP code, including expensive libraries.
Reading the prior as a belief. Drawing prior samples is not just a visualisation trick — it’s a sanity check on your kernel choice. If your prior samples wiggle 50 times across the domain, you’ve picked too small a length scale. If they’re nearly straight lines, you’ve picked too large a one. Always plot a handful of prior samples before fitting; if they look nothing like the kind of function you expect, change the kernel before you change anything else.
GP posterior: conditioning on observations#
$$\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \;\sim\; \mathcal{N}\!\left(\mathbf{0},\; \begin{bmatrix} K + \sigma_n^2 I & K_* \\ K_*^\top & K_{**} \end{bmatrix}\right),$$where $K = K(X, X)$ is $n \times n$ , $K_* = K(X, X_*)$ is $n \times m$ , $K_{**} = K(X_*, X_*)$ is $m \times m$ , and the $\sigma_n^2 I$ block accounts for the i.i.d. observation noise.
$$\begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} \sim \mathcal{N}\!\left(\begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}\right),$$ $$\mu_{b|a} = \mu_b + \Sigma_{ba} \Sigma_{aa}^{-1}(\mathbf{a} - \mu_a), \qquad \Sigma_{b|a} = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}.$$ $$\boxed{\;\mu_* = K_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}, \qquad \Sigma_* = K_{**} - K_*^\top (K + \sigma_n^2 I)^{-1} K_*.\;}$$These two lines are the central equations of GP regression. Everything else is implementation detail.
What each piece means.
- $(K + \sigma_n^2 I)^{-1} \mathbf{y}$ is a length-$n$ vector — call it $\boldsymbol{\alpha}$ . It does not depend on the test point.
- $\mu_*(x) = \sum_i \alpha_i K(x_i, x)$ — a weighted sum of kernel evaluations at the training points, exactly like Kernel Ridge.
- $\Sigma_*$ shrinks from the prior $K_{**}$ by an amount determined by how much information the training set carries about the test points.
- The diagonal of $\Sigma_*$ gives the predictive variance at each test point — this is the thing Kernel Ridge does not give you.

The figure shows the posterior at $n = 0, 2, 4, 7$ observations. With no data, the posterior is just the prior — a wide band centred at zero. As points come in, the mean snaps to them and the variance pinches to near-zero locally. Away from the data, the variance grows back to the prior level — the model honestly tells you “I don’t know”. That growth-of-uncertainty is what makes GPs uniquely suitable for Bayesian optimisation, active learning, and any “do I need more data here?” question.
GP regression in practice: a worked example#
Time for a real fit. We’ll use the closed-form equations directly — no library — so the moving parts are visible.
| |

Two important read-offs from the figure. Left panel uses $\sigma_n = 0.1$ — the model believes the data is nearly noise-free, so it bends the posterior mean to chase every point, including the two outliers. Right panel uses $\sigma_n = 0.4$ — the model accepts that observations are noisy, treats the outliers as such, and recovers a much smoother estimate that matches the true function better. The 95% credible interval ($\mu_* \pm 2\sigma_*$ ) is wider in the right panel, honestly reflecting the larger noise.
The Cholesky pattern. Notice we never call np.linalg.inv on $K + \sigma_n^2 I$
. Instead we Cholesky-factorise it once, then solve two triangular systems. This is $\sim 3\times$
faster and much more numerically stable than explicit inversion. It is also re-usable: $\boldsymbol{\alpha}$
does not depend on $x_*$
, so you compute it once at training time, then $O(n)$
per prediction for the mean, $O(nm + n^2 m)$
for the full posterior covariance over $m$
test points.
Three knobs to turn.
- Length scale $\ell$ . Larger $\ell$ → smoother predictions, longer-range correlations, wider extrapolation. Smaller $\ell$ → more wiggle, faster collapse of uncertainty away from training points.
- Signal variance $\sigma_f^2$ . Sets the prior amplitude. If your data ranges over $\pm 5$ , $\sigma_f^2 \approx 25$ is reasonable; if it ranges over $\pm 0.1$ , $\sigma_f^2 \approx 0.01.$
- Noise variance $\sigma_n^2$ . How much you trust each observation. Tied directly to how aggressively the mean fits individual points.
We could set all three by cross-validation as in Part 5 . But GPs come with a far more elegant tool.
Hyperparameter learning: the marginal likelihood#
$$p(\mathbf{y} \mid X, \theta) \;=\; \int p(\mathbf{y} \mid \mathbf{f}) \, p(\mathbf{f} \mid X, \theta) \, d\mathbf{f}.$$ $$\mathbf{y} \mid X, \theta \;\sim\; \mathcal{N}\!\big(\mathbf{0},\; K_\theta + \sigma_n^2 I\big),$$ $$\boxed{\;\log p(\mathbf{y} \mid X, \theta) \;=\; -\tfrac{1}{2} \mathbf{y}^\top (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} \;-\; \tfrac{1}{2} \log |K_\theta + \sigma_n^2 I| \;-\; \tfrac{n}{2} \log 2\pi.\;}$$This single scalar function of $\theta$ is what you maximise to fit hyperparameters. No validation split. No cross-validation folds. Just call an optimiser.
Three-term decomposition. Read the three terms in order:
- Data fit $-\tfrac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$ . Larger (closer to zero) when the kernel explains the residuals well. Pushes toward more flexible models.
- Complexity penalty $-\tfrac{1}{2} \log |K + \sigma_n^2 I|$ . The log-determinant is larger for kernels that allow more diverse functions. Pushes toward simpler models.
- Normalisation $-\tfrac{n}{2} \log 2\pi$ . A constant in $\theta$ , irrelevant to optimisation.
The fit term and the complexity term pull in opposite directions. The optimum is where they balance — exactly the Occam’s-razor sweet spot.

The figure plots $\log p(\mathbf{y} \mid \theta)$ over $(\log \ell, \log \sigma_n)$ for an RBF kernel on 25 noisy sine samples. The landscape has a clear ridge along $\ell \in [0.5, 1.5]$ and $\sigma_n \in [0.2, 0.5]$ , with a unique maximum near $(\ell, \sigma_n) \approx (0.9, 0.3)$ — close to the true noise. Any standard optimiser (L-BFGS) finds it in a few dozen iterations.
$$\frac{\partial \log p(\mathbf{y} \mid \theta)}{\partial \theta_k} \;=\; \tfrac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial K}{\partial \theta_k} \boldsymbol{\alpha} \;-\; \tfrac{1}{2} \operatorname{tr}\!\left((K + \sigma_n^2 I)^{-1} \frac{\partial K}{\partial \theta_k}\right),$$with $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1} \mathbf{y}$
. Both terms reuse the Cholesky factorisation, so each gradient evaluation costs the same $O(n^3)$
as one likelihood evaluation. Plug this and the likelihood into scipy.optimize.minimize(method='L-BFGS-B') and you’re done.
| |
Why the parameters live in log-space. All three parameters are positive (a length cannot be negative; neither can a variance). Parametrise with $\log \theta$ and L-BFGS gets to use unconstrained optimisation in $\mathbb{R}^3$ , while the actual hyperparameters automatically stay positive. This is a standard trick that prevents 90% of “GP doesn’t fit” issues.
Occam’s razor, mechanically#
Marginal likelihood implements Occam’s razor automatically. It doesn’t need to be told; it falls out of the math. To see why, look at the same data fit with three RBF length scales:

Top row: three GP regressions on 20 sin-like points using $\ell = 0.1$ (wiggly overfit), $\ell = 0.7$ (just right), $\ell = 4.0$ (almost a straight line). Bottom row: the marginal likelihood decomposition for each.
Read off the three regimes:
- $\ell = 0.1$ (overfit). Data fit term is high — the model passes through every point, residuals are tiny. But complexity penalty is very low (large $\log |K + \sigma_n^2 I|$ ) — the kernel matrix is nearly identity, so its determinant is huge. Net: low total.
- $\ell = 0.7$ (just right). Data fit is still high (the smooth model explains the trend), and the complexity penalty is moderate. Net: highest total.
- $\ell = 4.0$ (underfit). Data fit collapses — the model can’t explain the wiggle. Complexity penalty is high (small log-determinant), but it can’t compensate. Net: low total.
This is not a trade-off you have to design. It is the integration over functions doing the work for you. Models that can represent too many functions get penalised by the log-determinant; models that can represent too few get penalised by the residuals. The optimum is the model whose function class is just expressive enough for the data.
Compared with cross-validation. CV requires holding out data and refitting many times — expensive, especially for small datasets. Marginal likelihood uses all the data, requires no folds, and returns gradients for free. The catch: with many hyperparameters and small noisy data, the marginal likelihood surface can have multiple local optima. Standard mitigations: restart the optimiser from several random initialisations, or impose weak priors (log-normal) on length scales to regularise the search.
GP classification: from Gaussian to non-Gaussian likelihood#
$$p(y_i = 1 \mid f_i) = \sigma(f_i) = \frac{1}{1 + e^{-f_i}}.$$The posterior $p(f \mid \mathbf{y})$ is no longer Gaussian either. Three standard fixes:
- Laplace approximation. Find the posterior mode $\hat{\mathbf{f}}$ by Newton’s method, then approximate the posterior by a Gaussian centred at $\hat{\mathbf{f}}$ with covariance equal to the inverse Hessian of the log-posterior. Fast, deterministic, often surprisingly good.
- Expectation Propagation (EP). Iteratively replace each Bernoulli factor with a Gaussian chosen to match the first two moments of the full posterior. More accurate than Laplace, slower, sometimes unstable.
- MCMC. Hamiltonian Monte Carlo on the full posterior. Most accurate, slowest, what you do when the above two disagree.
scikit-learn’s GaussianProcessClassifier uses Laplace by default, which is good enough for most jobs:
| |

The left panel shows the predicted $P(y=1 \mid x)$ — a smooth probabilistic decision boundary. The right panel shows the predictive entropy: high near the decision boundary, low deep inside each class region, also high in regions of input space far from any training point. That last property is what an SVM cannot give you. An SVM hands a confident label to every point in $\mathbb{R}^2$ , even a point a million units away from the training data; a GP says “I have no idea”. For safety-critical applications — medical diagnosis, autonomous driving, fraud — this distinction matters.
GP regression equals Kernel Ridge — plus a free uncertainty#
$$\mu_*(x) = K_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} \;\;\equiv\;\; \hat{f}_{\mathrm{KRR}}(x).$$The two methods compute the same number. From a point-prediction standpoint, GP regression with Gaussian noise and kernel ridge regression are interchangeable; the matrix inversion is the same, the answer is the same, the cost is the same.
What differs is what else you get back. Kernel ridge stops at $\mu_*$ . The GP also returns $\Sigma_*$ — the predictive covariance — for the same $O(n^3)$ work. The GP additionally gives you a principled way to learn $\lambda$ (via marginal likelihood) instead of cross-validating.
Conceptual takeaway. GP and KRR are the same regressor wearing two costumes. KRR derives the answer from “minimise regularised squared loss in an RKHS”. GP derives it from “compute the posterior mean of a Gaussian process prior with noisy observations”. Same destination, different roads. The Bayesian road carries more luggage — the posterior covariance and the marginal likelihood — at no extra arithmetic cost. That’s why for any application that cares about uncertainty, the GP formulation is strictly preferable.
The $O(n^3)$ wall#
Every GP equation we’ve written involves $(K + \sigma_n^2 I)^{-1}$ , a Cholesky on an $n \times n$ matrix. That costs:
- Compute: $O(n^3)$ for the Cholesky, $O(n^2)$ per prediction.
- Memory: $O(n^2)$ to store $K$ .
For $n = 1{,}000$ , this is comfortable: $10^9$ flops takes seconds, $8\,\mathrm{MB}$ for the Gram matrix fits in cache. For $n = 10{,}000$ , you have $10^{12}$ flops and $800\,\mathrm{MB}$ — still possible, but minutes per iteration of the optimiser, hours for hyperparameter learning. For $n = 100{,}000$ and beyond, the matrix doesn’t fit in RAM and the cubic cost kills you regardless.
This is the only serious objection to GPs as a general-purpose model. Neural networks scale to billions of examples; vanilla GPs choke at tens of thousands. The entire research subfield of sparse and scalable GPs exists to push this wall back:
- Inducing-point methods (FITC, VFE, SVGP) compress $K$ down to a rank-$m$ approximation with $m \ll n$ , dropping costs to $O(nm^2)$ compute and $O(nm)$ memory.
- Random feature approximations (Random Fourier Features) replace the kernel with an explicit finite-dimensional feature map, dropping to linear cost in $n$ for many kernels.
- Kronecker and Toeplitz structure exploits grid-like inputs to factorise the Gram matrix.
- Iterative solvers (conjugate gradients) avoid Cholesky and approximate the matrix-vector products with GPU-friendly kernels.
These are the topic of Part 7 . The framework is identical to everything we built here — only the linear algebra changes.
Practical gotchas#
A short list of things that go wrong in real GP code:
- Not standardising inputs. RBF length scales are in input units. If $x$ ranges over $[0, 10^6],$ the default $\ell = 1$ makes every pair look identical. Always standardise both inputs and outputs to zero mean and unit variance before fitting.
- Forgetting jitter. Cholesky failure with the cryptic “matrix is not positive definite” is almost always solved by adding $10^{-6} I$ to $K$ .
- One length scale per dimension (ARD). For multidimensional inputs, give each dimension its own length scale: $K(x, y) = \exp(-\sum_d (x_d - y_d)^2 / (2 \ell_d^2)).$ This is called Automatic Relevance Determination, and the learned $\ell_d$ tell you which features the model thinks matter.
- Single optimisation run. Marginal likelihood has multiple local optima for small $n$ . Always run the optimiser from $\geq 5$ random restarts and keep the best.
- Predictive mean only. If you’re going to use the mean and throw away the variance, why are you using a GP? You paid $O(n^3)$ for the variance — use it.
- No noise term. $\sigma_n^2 = 0$ makes the kernel matrix singular and the model overconfident. Even if your data is “deterministic” (e.g. computer simulation output), set $\sigma_n^2 \geq 10^{-4}$ as a numerical stabiliser.
When to use GPs (and when not)#
Use GPs when:
- $n \lesssim 10^4$ and you need calibrated uncertainty.
- You’re doing Bayesian optimisation, active learning, or any sequential decision-making.
- You have very little data and need a strong inductive bias (the kernel) plus honest error bars.
- You want to learn hyperparameters from data without cross-validation.
Don’t use GPs when:
- $n \gg 10^4$ and the problem doesn’t have exploitable structure (grids, low-rank kernels, etc.). Use sparse GPs (Part 7 ) or neural networks.
- The function is highly multimodal in input space (deep neural networks model this more naturally).
- The likelihood is something exotic that breaks the conjugacy. Stan/PyMC with a custom model may serve better than wedging it into a GP framework.
- The features are huge (images, text). Use a deep net to learn embeddings, then optionally put a GP on top (deep kernel learning — Part 8 ).
What’s next#
Part 6 turned the kernel into a Bayesian model: a prior over functions, a posterior with uncertainty, a marginal likelihood that learns its own hyperparameters. The arithmetic is identical to Part 5 — one Cholesky on a Gram matrix — and the additional cost is exactly zero. The price you pay is the $O(n^3)$ scaling that bites hard above ten thousand points.
Part 7 breaks that wall. Two algorithms — Nyström approximation and Random Fourier Features — bring kernel methods (including GPs) from $O(n^3)$ down to $O(n)$ or $O(nm^2)$ , opening up datasets of millions of points without abandoning the kernel framework. The trade-off is explicit: a tunable approximation knob $m$ that controls how close you stay to the exact kernel solution. We’ll work through both algorithms, derive when each approximation is tight, and benchmark them on real-scale problems.
This is Part 6 of Kernel Methods (8 parts). Previous: Part 5 — Kernel SVM, PCA, Ridge · Next: Part 7 — Large-Scale Kernels (Nystrom + RFF)
Kernel Methods 8 parts
- 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
- 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
- 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
- 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
- 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
- 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference you are here
- 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
- 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide