Series · Kernel Methods · Chapter 7

Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features

Kernel methods are O(n^3). Nystrom approximation and Random Fourier Features pull them back to linear time without giving up the kernel trick's expressive power.

You want to train an RBF SVM on a million-image classification set. The Gram matrix is $10^6 \times 10^6$ doubles, which is 8 TB. That number alone — eight terabytes of RAM, just to store the kernel — is why most working data scientists who learned kernel methods in a stats class quietly never reach for them on real production workloads. The kernel trick gives you an infinite-dimensional feature space for the cost of one dot product per pair; the bill arrives when you have $n^2$ pairs.

Nystrom and Random Fourier Features: two routes from O(n^3) to linear time

Parts 1 through 6 of this series built the kernel-methods story with no scaling caveat: lift the data into an RKHS (Parts 1 -3 ), pick a kernel that matches your assumptions (Part 4 ), solve a convex optimisation problem to get an SVM (Part 5 ), or do exact Bayesian inference with a GP (Part 6 ). This part introduces the two industrial-grade hacks that broke kernel methods out of the $n \lesssim 10^4$ ghetto: Nystrom approximation and Random Fourier Features (RFF). Both reduce kernel methods to linear methods on an explicit low-dimensional feature map, and you trade exactness for being able to fit in 16 GB of RAM.

The computational curse: why kernels do not scale#

Three numbers tell the whole story for a kernel method on $n$ training points.

  1. Memory: $O(n^2)$ . You have to store the Gram matrix $K \in \mathbb{R}^{n \times n}$ to either invert it (GP, kernel ridge), factorise it (SVM dual), or even just compute its action on a vector.
  2. Training time: $O(n^3)$ . Inverting or Cholesky-factorising a dense $n \times n$ matrix is cubic. SVM dual solvers like SMO are nominally $O(n^2)$ per epoch but the constant factor and the number of epochs combine to land you back at cubic-ish on hard problems.
  3. Prediction time per query: $O(n \cdot d)$ . By the representer theorem, the solution is $f(x) = \sum_i \alpha_i K(x_i, x).$ Even if most $\alpha_i = 0$ (in SVM, only support vectors survive), you typically end up with thousands of them on real data, and each requires a $d$ -dimensional kernel evaluation.

Memory and training-time blow-up at four sample sizes

To put concrete numbers on it: at $n = 10^4,$ the Gram matrix is 800 MB and a $K^{-1}$ takes a few seconds. At $n = 10^5,$ it’s 80 GB — already past most workstation RAM — and the inversion is hours. At $n = 10^6,$ it’s the 8 TB figure from the opening, and the inversion would take weeks even on a top-tier GPU. Meanwhile, a linear model on the same data runs in minutes and a deep network in hours. So the unfortunate truth is that exact kernel methods are a $10^3$$10^4$ technology in 2026, and everything beyond that is some form of approximation.

The two paths. There are essentially two ways to attack the $O(n^2)$ memory wall:

  • Approximate the kernel matrix. Replace $K$ with a low-rank or block-structured matrix that can be stored and inverted cheaply. This is Nystrom approximation and its many cousins (Random sampling, leverage scores, conjugate-gradient with preconditioning).
  • Approximate the feature map. Build an explicit, finite-dimensional $\hat\phi(x) \in \mathbb{R}^D$ such that $\hat\phi(x)^T \hat\phi(y) \approx K(x, y),$ then run a linear model on $\hat\phi(X).$ This is Random Fourier Features (and Quasi-Random / Orthogonal RFF and Fastfood and a dozen other variants).

The first is data-dependent: it looks at your $X$ and picks a clever low-rank basis. The second is data-independent: it draws random frequencies from a kernel-specific distribution and uses them as a universal explicit basis. Both reduce a kernel SVM or kernel ridge regression to a $D$ -dimensional linear problem with $D \ll n,$ and that’s the whole game. The rest of this article works out the maths and the practical recipes.

Nystrom approximation: the core idea#

The Nystrom method (originally a 19th-century quadrature trick for integral equations, adapted to ML by Williams and Seeger in 2001) starts from one observation: the Gram matrix of an RBF or Matern kernel almost always has rapidly decaying eigenvalues. The “infinite-dimensional feature space” of the RBF kernel is real, but in any finite sample, the spectrum decays exponentially. A few hundred eigenvalues typically capture 99% of the trace.

If the matrix is approximately low-rank, then a smart subsample of $m \ll n$ rows and columns ought to capture most of the action. That is exactly what Nystrom does.

$$K = \begin{pmatrix} K_{mm} & K_{mn-m} \\ K_{n-m,m} & K_{n-m, n-m} \end{pmatrix},$$ $$\hat K = K_{nm}\, K_{mm}^{-1}\, K_{mn},$$

where $K_{mn} = K_{nm}^T.$ This is rank at most $m,$ stored in $O(nm),$ inverted in $O(m^3),$ and exact on the $m \times m$ block $K_{mm}$ (you can check: $\hat K_{II} = K_{mI} K_{mm}^{-1} K_{Im} = K_{II}$ ).

The intuition. You are saying “for predicting at any new point, I’ll go through the landmarks”. The landmarks act as a kind of fixed basis through which all other points are expressed.

Three strategies for picking inducing points

Nystrom: the mathematical derivation#

Let me unpack why $\hat K = K_{nm} K_{mm}^{-1} K_{mn}$ is the right approximation, and not some other low-rank guess. Two derivations: one via the integral equation Nystrom inherited, one via the eigendecomposition.

$$K_{mm} = U_m \Lambda_m U_m^T,$$ $$\Phi_n = K_{nm}\, U_m\, \Lambda_m^{-1/2} \in \mathbb{R}^{n \times m}.$$ $$\Phi_n \Phi_n^T = K_{nm}\, U_m\, \Lambda_m^{-1}\, U_m^T\, K_{mn} = K_{nm}\, K_{mm}^{-1}\, K_{mn} = \hat K.$$

So the Nystrom approximation is equivalent to using a finite-dimensional explicit feature map of dimension $m,$ derived from a kernel PCA on the landmark subset.

This is the punch line: Nystrom turns your kernel method into a linear method in $\mathbb{R}^m.$ You compute $\Phi_n$ once at cost $O(nm \cdot d + m^3),$ then run a linear SVM / linear ridge regression on it in $O(nm^2)$ time and $O(nm)$ memory. Compared to the original $O(n^3)$ time and $O(n^2)$ memory, this is a massive win when $m \ll n.$

$$\hat\psi_k(x) \approx \frac{1}{\sqrt{\lambda_k}} \cdot \frac{1}{\sqrt{m}} \sum_{j=1}^m u_{j,k} K(x, x_{i_j}).$$

The original 1928 Nystrom method used uniform quadrature; the modern ML version uses random sampling because that’s how the empirical measure looks.

$$\|K - \hat K\|_F^2 \;\leq\; \|K - K_r\|_F^2 + \epsilon \cdot \mathrm{tr}(K),$$

provided $m \gtrsim r \log r / \epsilon^2.$ The key takeaway: error is bounded by the spectral tail $\|K - K_r\|_F$ . If your kernel’s spectrum decays fast (RBF on smooth data: yes), small $m$ works. If it decays slowly (RBF on noisy high-dimensional data: maybe), you need bigger $m.$

Spectral decay of Gram matrices vs the Nystrom rank-m approximation

Numerical caveat. $K_{mm}^{-1}$ is dangerous when the landmarks are close to one another (then $K_{mm}$ is near-singular). In practice you use the pseudoinverse $K_{mm}^+,$ or a regularised inverse $(K_{mm} + \mu I)^{-1}$ with small $\mu > 0,$ or a truncated eigendecomposition that keeps only $\lambda_k$ above a threshold. sklearn.kernel_approximation.Nystroem does the truncation by default.

Nystrom in practice with scikit-learn#

The user-facing recipe is dead simple.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    Nystroem(kernel="rbf", gamma=0.5, n_components=500, random_state=0),
    LogisticRegression(max_iter=2000, C=1.0),
)
pipe.fit(X_train, y_train)
print("test acc:", pipe.score(X_test, y_test))

That’s it. Nystroem is a feature-map transformer that produces the $m$ -dimensional features $\Phi_n$ from above; everything downstream is plain linear-model machinery. Three knobs to think about:

  • n_components (which is $m$ ). The single most important hyperparameter. Bigger $m$ = better approximation = slower training. Start at $m = 256$ for prototyping; push to 500-2000 for production; the marginal accuracy from $m > 5000$ is usually not worth the time.
  • kernel and its parameters. Same syntax as sklearn.metrics.pairwise.PAIRWISE_KERNEL_FUNCTIONS: "rbf", "polynomial", "sigmoid", "laplacian", or a callable. Tune gamma for RBF / Laplacian as you would in SVC(kernel='rbf').
  • random_state. Nystroem picks landmarks uniformly at random by default. This is fine when classes are roughly balanced; for imbalanced data, you may want stratified sampling (manually pick landmarks).

Landmark selection strategies, ranked by quality. Uniform random is the default, but four alternatives are routinely better when you can afford them.

  1. Uniform random. $O(m)$ cost. Default; works on balanced, well-behaved data.
  2. K-means landmarks. Run $k$ -means on $X,$ use the $m$ centroids as landmarks. $O(nm \cdot d)$ cost (one $k$ -means pass). Almost always a 10-30% accuracy bump over uniform random; the geometry of K-means clusters mirrors the geometry of the data manifold, so you spend more landmarks where there’s more data.
  3. Leverage-score sampling. Sample landmark $i$ with probability proportional to $\ell_i = [K K^{-1}]_{ii}$ (the leverage score of point $i$ ). Theoretically the optimal sampling distribution; in practice expensive to compute exactly. Approximations exist.
  4. Greedy / determinantal point processes (DPP). Maximise $\det(K_{mm})$ over the choice of $m$ landmarks. Gives the most “spread out” landmark set. $O(nm^2)$ for greedy; exact DPP is exponential.

In my experience, K-means landmarks is the right default — it’s two extra lines of code, runs in seconds, and meaningfully improves the approximation for $m \leq 1000.$ For $m > 5000,$ uniform random catches up because the spectrum tail dominates anyway.

Sketch of when Nystrom is the right tool. $n$ between $10^4$ and $10^6;$ the data has clear geometric structure (so K-means landmarks make sense); you want a single model rather than retraining many times on subsamples; downstream task is a standard linear-model task (SVM, logistic regression, ridge, kernel PCA visualisation).

Random Fourier Features: Bochner’s theorem as the entry ticket#

The Nystrom method is data-dependent: it inspects your training set to pick landmarks. Random Fourier Features (RFF), due to Rahimi and Recht (2007), takes the opposite tack — a data-independent random feature map that works for any shift-invariant kernel.

The mathematical underpinning is one of the prettier theorems in harmonic analysis.

$$k(x - y) = \int_{\mathbb{R}^d} p(\omega)\, e^{i \omega^T (x - y)}\, d\omega.$$

In other words, every shift-invariant PD kernel is the characteristic function of some probability distribution (after normalising $p$ to integrate to 1). The kernel is the inverse Fourier transform of its spectral density.

The spectral density of common kernels.

  • RBF / Gaussian $k(\delta) = \exp(-\gamma \|\delta\|^2):$ spectral density is Gaussian, $p(\omega) = (4\pi\gamma)^{-d/2} \exp(-\|\omega\|^2 / (4\gamma)).$
  • Laplacian $k(\delta) = \exp(-\|\delta\|_1 / \ell):$ spectral density is Cauchy-like.
  • Matern-$\nu$ : spectral density is Student’s-$t$ -like with $\nu$ degrees of freedom.
  • Cauchy kernel $k(\delta) = \prod_i (1 + \delta_i^2)^{-1}:$ spectral density is Laplace.

Notice the trade: a kernel with very localised similarity ($\gamma$ large, narrow $k$ ) has a wide spectral density (high-frequency content). A kernel with very smooth, long-range similarity has a concentrated spectral density (low-frequency content). This is just the standard Fourier uncertainty: narrow in space $\Leftrightarrow$ wide in frequency.

$$k(x - y) = \int p(\omega) \cos(\omega^T (x - y))\, d\omega = \mathbb{E}_{\omega \sim p}\big[ \cos(\omega^T x - \omega^T y) \big].$$ $$k(x - y) = \mathbb{E}_{\omega \sim p}\big[ \cos(\omega^T x)\cos(\omega^T y) + \sin(\omega^T x)\sin(\omega^T y) \big].$$ $$\hat\phi(x) = \sqrt{\frac{2}{D}} \big[ \cos(\omega_1^T x), \sin(\omega_1^T x), \dots, \cos(\omega_{D/2}^T x), \sin(\omega_{D/2}^T x) \big] \in \mathbb{R}^D.$$

Then $\hat\phi(x)^T \hat\phi(y) \to k(x - y)$ as $D \to \infty,$ with variance $O(1/D)$ at each point.

RFF concept: spectral density to frequency samples to cosine features

$$\hat\phi(x) = \sqrt{\frac{2}{D}} \big[ \cos(\omega_1^T x + b_1), \dots, \cos(\omega_D^T x + b_D) \big].$$

This is what sklearn.kernel_approximation.RBFSampler implements. Why does it work? Because $\mathbb{E}_b[\cos(\omega^T x + b)\cos(\omega^T y + b)] = \tfrac{1}{2}\cos(\omega^T (x - y))$ (averaging over the uniform phase kills the cross term), recovering Bochner up to constants.

The RFF algorithm step by step#

For the RBF kernel $k(\delta) = \exp(-\gamma \|\delta\|^2),$ the spectral density is $p(\omega) = \mathcal{N}(0, 2\gamma I_d).$ The full algorithm is then:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import numpy as np

def rbf_rff_features(X, gamma, D, rng=None):
    """RFF for RBF kernel K(x, y) = exp(-gamma * ||x - y||^2)."""
    rng = np.random.default_rng(rng)
    n, d = X.shape
    # Step 1: sample D frequencies from N(0, 2*gamma * I)
    W = rng.normal(scale=np.sqrt(2 * gamma), size=(d, D))
    # Step 2: sample D phases from Uniform[0, 2*pi]
    b = rng.uniform(0, 2 * np.pi, size=D)
    # Step 3: compute features
    Phi = np.sqrt(2.0 / D) * np.cos(X @ W + b)
    return Phi, (W, b)  # return W, b so you can transform test data

# Train
Phi_train, params = rbf_rff_features(X_train, gamma=0.5, D=1024, rng=0)
# Transform test using the *same* W, b
W, b = params
Phi_test = np.sqrt(2.0 / 1024) * np.cos(X_test @ W + b)

Three things to internalise:

  1. The frequencies $W$ are sampled once, at train time, and stored. You must use the same frequencies when transforming test data; otherwise the inner product doesn’t approximate the kernel.
  2. Cost per point: $O(D \cdot d).$ Same as a linear model evaluation. There’s no quadratic dependency on $n$ anywhere.
  3. Why $\sqrt{2/D}.$ This is the normalisation that makes $\hat\phi^T \hat\phi$ an unbiased estimator of $k.$ Without it you get the right shape but the wrong scale.
$$\Pr\!\left[\sup_{x, y \in X} |\hat\phi(x)^T \hat\phi(y) - k(x - y)| \geq \epsilon \right] \leq 2^8 (r \sigma_p / \epsilon)^2 \exp\!\left(-\frac{D \epsilon^2}{4(d+2)}\right),$$

where $\sigma_p^2 = \mathbb{E}_{\omega \sim p}[\|\omega\|^2].$ Squinting at this: error $\sim 1/\sqrt{D}$ uniformly, with logarithmic dependence on $r$ and $d.$ Doubling $D$ halves the standard error.

RFF kernel approximation quality as D increases

RFF in practice: a 50k-sample worked example#

Let me make this concrete. The MNIST-style problem: 50,000 training images, 784 features per image (28 × 28 pixels), binary classification (digit 3 vs digit 8). Three pipelines to compare:

  1. Exact RBF SVM. SVC(kernel='rbf', gamma=0.05). The gold standard for accuracy; the disaster for runtime.
  2. Nystrom + Logistic Regression. Nystroem(n_components=500) + LogisticRegression. Data-dependent low-rank.
  3. RFF + Logistic Regression. RBFSampler(n_components=1000) + LogisticRegression. Data-independent random features.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import time
import numpy as np
from sklearn.svm import SVC
from sklearn.kernel_approximation import Nystroem, RBFSampler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

GAMMA = 0.05

def time_and_score(pipe, X_tr, y_tr, X_te, y_te):
    t0 = time.time()
    pipe.fit(X_tr, y_tr)
    fit_t = time.time() - t0
    acc = pipe.score(X_te, y_te)
    return fit_t, acc

# 1) Exact RBF SVM
svm = SVC(kernel='rbf', gamma=GAMMA, C=1.0)
# 2) Nystrom + Logistic
nys = make_pipeline(
    Nystroem(kernel='rbf', gamma=GAMMA, n_components=500, random_state=0),
    LogisticRegression(max_iter=2000, C=1.0, solver='lbfgs'),
)
# 3) RFF + Logistic
rff = make_pipeline(
    RBFSampler(gamma=GAMMA, n_components=1000, random_state=0),
    LogisticRegression(max_iter=2000, C=1.0, solver='lbfgs'),
)

Indicative wall-clock numbers on a 2026-era laptop (8-core, 32GB RAM):

PipelineFit timeTest accuracy
Exact RBF SVM~17 min0.992
Nystrom (m=500) + Logistic~12 sec0.989
RFF (D=1000) + Logistic~8 sec0.987

You give up a small fraction of a percentage point of accuracy in exchange for ~100× faster training. On 500K points, exact SVM is intractable on commodity hardware while both approximations stay in the minutes. This is the trade that the field collectively decided was worth taking in the early 2010s; it’s the reason kernel methods didn’t die.

A few practical sklearn notes.

  • RBFSampler uses the phase-shifted-cosine form. That’s why its output is $D$ -dimensional, not $2D$ -dimensional. If you want the explicit sin/cos pair version, write it yourself (10 lines).
  • SkewedChi2Sampler and AdditiveChi2Sampler in sklearn.kernel_approximation are RFF-like maps for non-shift-invariant kernels used in computer-vision histogram comparison. Niche, but they exist.
  • Combine with SGDClassifier for streaming. RFF features are a fixed-dimensional embedding, so they slot straight into stochastic gradient descent. RFF + SGD scales to $n = 10^8.$
  • Normalise after the feature map. Both Nystrom and RFF can produce features with high dynamic range; a downstream StandardScaler or LogisticRegression(... solver='lbfgs') with appropriate C is needed.

Nystrom vs RFF: when to pick which#

You now have two tools that both turn an $O(n^3)$ kernel method into an $O(n)$ linear method. Which one should you reach for? The honest answer is “try both”, but here are the priors.

Pick Nystrom when:

  • You have a non-shift-invariant kernel. RFF requires Bochner’s theorem, which needs $k(x, y) = k(x - y).$ Polynomial kernels, string kernels, and most graph kernels are NOT shift-invariant; Nystrom works for any PD kernel.
  • The data has clear cluster structure and K-means landmarks make sense. Nystrom is data-dependent and exploits this; RFF cannot.
  • You want a deterministic approximation. With a fixed landmark set, the Nystrom map is deterministic; with a fixed $W, b,$ RFF is also deterministic, but the “good random seed” effect is much stronger for small $D.$
  • You need very high accuracy with modest $m \approx 100$ -$500.$ Empirically, Nystrom dominates RFF in this regime because the spectrum-truncation bound is tighter than the Monte Carlo bound.

Pick RFF when:

  • The kernel is shift-invariant (RBF, Laplacian, Matern, Cauchy). For these, RFF is the textbook choice.
  • You want streaming training. RFF features depend on $W, b$ which are sampled once and never touched again; you can stream $X$ through them and feed an SGD classifier. Nystrom requires the landmark set up front.
  • You want minimal hyperparameters. RFF has effectively one knob ($D$ ); Nystrom has two ($m$ and the landmark strategy).
  • You’re using GPUs and want easy batching. RFF is one matmul + one cos; trivially batched and JIT-able.

Nystrom vs RFF trade-off across approximation budgets

A meta-rule. For supervised learning with shift-invariant kernels in 2026, RFF + SGD is the path of least resistance. For non-shift-invariant kernels or settings where you’ll re-tune the kernel hyperparameter many times (so you want the landmark structure to carry information), Nystrom + L-BFGS is the play. For Bayesian / probabilistic models, neither of these is quite right — see the next section.

Large-scale GPs: sparse approximations briefly#

Nystrom and RFF were originally developed for frequentist kernel methods — SVM, kernel ridge regression, kernel PCA. For Gaussian process regression, the same scaling cliff applies: GP inference is $O(n^3)$ for the Cholesky of $K + \sigma^2 I.$ A whole sub-field grew up around scaling GPs to large $n,$ and the dominant family is sparse / inducing-point GPs. A high-level tour:

FITC (Fully Independent Training Conditional, Snelson and Ghahramani 2005). The earliest. Define $m$ inducing inputs $Z \in \mathbb{R}^{m \times d}$ and replace the training-set prior $K_{nn}$ with the Nystrom approximation $K_{nm} K_{mm}^{-1} K_{mn}$ — same matrix as classical Nystrom — plus a diagonal correction to fix the marginal variances. Inference becomes $O(nm^2),$ which is the same complexity as Nystrom + linear. Tends to under-estimate uncertainty in unexplored regions.

$$\mathcal{L}_{\text{VFE}} = \log \mathcal{N}(y; 0, Q_{nn} + \sigma^2 I) - \frac{1}{2\sigma^2}\, \mathrm{tr}(K_{nn} - Q_{nn}),$$

where $Q_{nn} = K_{nm} K_{mm}^{-1} K_{mn}.$ The first term is the usual marginal likelihood for a low-rank GP; the second is a regularisation term that pulls the inducing inputs toward configurations where the Nystrom approximation is accurate. VFE is now the default in gpytorch and GPflow.

$$\mathcal{L}_{\text{SVGP}} = \sum_{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \mathrm{KL}[q(u) \| p(u)],$$

where $q(u) = \mathcal{N}(u; \mu, \Sigma)$ is a multivariate Gaussian over the inducing values and $q(f_i) = \int p(f_i | u) q(u)\, du$ is the implied marginal on the $i$ -th function value. The first sum decomposes by data point, allowing minibatching; the KL term is between two $m$ -dimensional Gaussians and computed once per gradient step.

In practice, the divisions are roughly:

  • $n \leq 10^4$ : exact GP. Use gpytorch.models.ExactGP. No approximation needed.
  • $10^4 \leq n \leq 10^5$ : FITC or VFE with $m = 100$ -$500$ inducing points. Both are “single batch” — fit the whole data per gradient step.
  • $10^5 \leq n \leq 10^7$ : SVGP with $m = 500$ -$2000$ inducing points and minibatches of $\sim 1024$ points. The workhorse for large GPs in 2026.
  • $n > 10^7$ : decision-time. Either further approximation (KISS-GP, structured kernel interpolation), or switch to deep learning entirely (which is what Part 8 will discuss).

The honest summary is that exact GPs are even more bottlenecked than exact SVMs — they require the Cholesky of the full Gram-plus-noise matrix, which doesn’t accept the same lazy tricks SVMs do. So sparse GPs have been a more critical research area than Nystrom-for-SVMs. If you’re going to use a GP on real data in 2026, you’re using SVGP whether you know it or not.

A unified picture#

Look at where this leaves us. Both Nystrom and RFF reduce the kernel method to a linear method on an $\mathbb{R}^D$ feature map, with $D \ll n.$ Both have similar error rates: $1/\sqrt{D}$ for RFF (Monte Carlo), spectral-tail-of-$K$ for Nystrom (data-dependent). Both are now thirty-line implementations in scikit-learn. The whole frontier of “scalable kernel methods” research from 2007 through 2015 was about variations of these two ideas:

  • Fastfood (Le, Sarlos, Smola 2013). RFF where the random matrix $W$ is replaced by a structured Hadamard-Gaussian-diagonal product so multiplication becomes $O(D \log d)$ instead of $O(Dd).$ Useful when $d$ is huge.
  • Quasi-Random Fourier Features. Replace iid $\omega \sim p$ with low-discrepancy (Sobol, Halton) sequences. Error decays at $1/D$ instead of $1/\sqrt{D}$ in the favorable regime.
  • Orthogonal Random Features (Yu et al. 2016). Force the $W$ rows to be orthogonal. Strict improvement over vanilla RFF, no extra cost.
  • Structured Spinners and Hadamard tricks. Various ways to make $W$ have low-storage / fast-multiplication structure.
  • Nystrom + leverage scores (Alaoui and Mahoney 2014). Sample landmarks with probability proportional to ridge leverage scores; matches the lower bound for any sampling-based approximation.
  • Doubly stochastic kernel methods (Dai et al. 2014). Randomise over both data and features, pushing RFF up to $n = 10^9.$ The core trick is to mini-batch the $W$ matrix too — only look at a small slice of frequencies per step, paired with downstream SGD, dropping the per-step cost from $O(D d)$ to $O(b d)$ ($b \ll D$ is the frequency batch). Provably converges to the exact kernel solution; in practice scales to $10^9$ data points and $10^5$ frequencies. This is as engineered as the RFF line gets.
  • EigenPro (Ma & Belkin 2017). Preconditioned SGD that solves kernel ridge regression directly, bypassing explicit low-rank approximation; runs to $n \approx 10^7$ on GPU. It’s not mutually exclusive with Nystrom — compute the preconditioner via Nystrom then SGD-solve — and that combo is one of the fastest KRR recipes around. This line is sometimes called “GPU-native kernel methods”: keep everything in matrix multiplies and element-wise ops, maximise GPU utilisation.
  • Falkon (Rudi, Carratino, Rosasco 2017). The engineering capstone of Nystrom + conjugate gradient + preconditioning, currently the SOTA library for large-scale KRR. The key trick is using $K_{mm}^{-1/2}$ as the CG preconditioner — sidestep explicit inversion while keeping fast convergence — letting a single GPU handle kernel ridge regression at $n = 10^7, m = 5 \times 10^4$ in a few hours. This is as engineered as the Nystrom line gets. If your project is essentially kernel ridge regression, reach for the Falkon library; it’s orders of magnitude faster than rolling your own on top of sklearn.

In 2026, the consensus practitioner stack is: RFF when shift-invariant kernel + want simplicity, Nystrom when non-shift-invariant or hand-tuned landmarks, SVGP when probabilistic, and deep learning when the data shape is so complex that no fixed kernel can capture it. Part 8 of this series will be on that last point — why deep learning ate kernel methods’ lunch, and where the kernel viewpoint comes back to bite even modern deep nets (think: NTK).

One last observation: random features had a comeback. After 2020, random-feature thinking resurfaced inside Transformers — Performers (Choromanski et al. 2021) used FAVOR+ to compress softmax attention from $O(n^2)$ to $O(n),$ and the core move is writing $\exp(q^T k / \sqrt d)$ in RFF form. This is a clean case of kernel-methods theory feeding deep learning back: Bochner 1933 → Rahimi-Recht 2007 → Performers 2021, one continuous lineage. The next part traces that line more systematically.

A similar revival list. Linformer, Linear Attention, Reformer, Nyströmformer — the last one literally has Nyström in its name — this whole cluster of 2020-2022 work used kernel approximation to drop attention from quadratic to linear. Each one, at heart, treats the attention matrix as a PD kernel matrix and approximates it. Which means the Transformer “long-context revolution” is, fundamentally, applied kernel-methods scalability research — a statistical tool from 2007 that became the core of large-model architectures a dozen years later. You don’t feel this lineage if you only read deep-learning papers; look back from the kernel-methods side and the whole story is coherent.

A mental model to close on. The whole kernel-methods series compressed into one sentence: the kernel trick lets you do linear methods in an infinite-dimensional space at the cost of one inner product ($K(x, y)$ ); but there is no free inner product — the bill is the $O(n^2)$ Gram matrix. Nystrom pays less by looking at a subset (data-dependent), RFF pays less by sampling (data-independent), SVGP pays less by variational decomposition (probabilistic view). Behind the three economies is the same fact: kernel methods in the big-data era have to be approximated, and the cost of approximation is far less than the accuracy lost. Once you see this, you see why 2007-2015 produced so many papers that were Nystrom and RFF variants — they were all answering the same question from different angles.

Two short pieces of advice for practitioners. First, pick the right family before tuning anything: non-shift-invariant kernel → Nystrom; shift-invariant + streaming → RFF; need uncertainty → SVGP. Any project that gets these three wrong is unrecoverable, no matter how hard you grid-search after. Second, always compare against a linear baseline: at $n > 10^5,$ a tuned logistic regression on handcrafted features is usually one or two percentage points behind RFF + logistic, and it trains 10× faster and is 10× more interpretable. Kernel methods are worth using, but don’t reach for them just because they’re “fancy” — only when a linear model genuinely cannot do the job. This rule held in 2007 and it holds in 2026.

A third one for the same list: prototype with the approximation first, then decide whether to pay for the exact run. A common antipattern in industry projects is “train an exact SVM for a week to see if the problem is solvable, find out it isn’t, throw it away”. The right order is reversed: spend 30 seconds on RFF + LR to see roughly where the accuracy ceiling sits, then decide whether a week on the exact version is worth it. The approximation almost never lags the exact version by more than 1-2 percentage points, which makes it a very cheap feasibility probe. This workflow saves enormous amounts of “theoretically workable but ROI too low” project time.

Two engineering practices that are easy to forget. First, the approximation’s hyperparameters cannot be inherited from the exact version. The $C, \gamma$ that tuned beautifully for an exact RBF SVM are usually suboptimal for Nystrom + LR — the downstream linear head’s capacity and regularisation scale are different, and you have to grid-search again. Second, report the variance from approximation in your CV: fix $m$ and run 5 different random_states, report mean ± std, not a single number. In the small-$m$ regime the std is routinely 0.5%-1.5%, larger than the gap between your downstream algorithm versions; without reporting it, your ablation is just noise.

A third often-forgotten practice: keep a holdout for the “approximate vs exact” comparison, don’t blow your entire budget on CV. The standard recipe is train/val for hyperparameter tuning and test for generalisation; once you introduce approximation, you also need a “sanity” holdout — run the exact kernel on a small subset ($n_{\text{sanity}} = 5000$ ) as an oracle and check how close the approximation’s predictions are to it. If the approximation is off by more than 1% on the sanity set, your $m / D$ is too small; do not scale up. This workflow saves a lot of “model shipped, model broken in production” incidents on new datasets.

One final engineering detail: incremental learning. Nystrom strictly does not support online learning — when new data arrives the landmarks are still the old ones; the feature map $\Phi_n$ can be partially reused but the downstream model has to be retrained. RFF does support it, because $W, b$ are sampled once and never touched again, and new data can stream through $\cos(X W + b)$ unchanged. So in a “model rolls daily” setting, RFF is almost the only option; if you want Nystrom to play well there, you have to go to streaming Nystrom, a newer and still-evolving research direction.

Written before jumping to Part 8. The core of this part is scale: when $n$ is too large for the exact kernel solution, what do you do. The next part swaps axes: when the function space itself is too complex for any fixed kernel to describe, what do you do. The first answer is “approximate the kernel”; the second is “learn the kernel” — that is, deep kernel learning, the two-way bridge between neural networks and kernels. The two questions look unrelated but are the two ends of a continuous spectrum: fixed kernel + big data (this part) vs learned kernel + big model (next part). All the hybrid forms in between — fixed bottom kernel + learned top combination, fixed top head + learned bottom embedding — are intermediate points on that spectrum.

The bridge between the two questions. One nice bridge: Nystrom’s inducing points $Z$ can become learnable parameters (which is what SVGP does), and one step further — can we let the “inducing feature map” also become a learnable neural network? Yes, that’s deep kernel learning (Wilson et al. 2016): use a neural network as $\phi(x),$ wrap an RBF kernel around it, and run SVGP on top. This path stitches “approximate kernel” and “learn kernel” together, and it’s a core motif of the next part. So this part ends precisely where the next one begins — the last mile of scalability leads into the first mile of learnability.

The same bridge explains why NTK / NNGP — the “infinite-width neural network equals GP” theory — suddenly exploded around 2018: as width goes to infinity, a neural network’s output covariance converges to an analytically writable kernel, and that kernel is the NTK. Once deep learning gets translated into a kernel method, all 30 years of kernel theory tooling (representer theorem, convergence analysis, spectral decomposition) becomes available again. The next part takes apart every girder of that bridge.

An overall view of kernel methods as a discipline. Nystrom and RFF marked the end of an era: 1995-2010 was the golden age of “exact kernel + small data”; after 2010 kernel methods did not disappear, they degraded into a style — its ideas (feature space, representer theorem, spectral view) seeped into neural networks (NTK, attention-as-kernel), generative models (MMD, kernel scores), reinforcement learning (kernel Bellman operators), and causal inference (kernel conditional independence). The next part will be specifically about that path — but the key point belongs here: kernel methods did not lose to deep learning, they became part of deep learning.

A slightly personal note. Writing my way to Part 7 of this series, I noticed my attitude toward the word “approximation” had changed. A decade ago I thought approximation was a fallback, a compromise; now I think approximation is the main road and exactness is the luxury — exact methods should be the default only when the data is small enough that exact still pays. Nystrom and RFF teach this lesson: instead of asking “can I solve this problem exactly”, ask “can I solve this problem cheaply at the precision I actually care about”. That question-substitution is one of the fundamental moments in machine learning maturing from an academic discipline into an engineering one.

A frequently-cited reflection. When Rahimi accepted the NeurIPS test-of-time award in 2017, he gave a talk about his worry over “alchemy” — that a lot of deep learning’s engineering progress was running far ahead of its theoretical explanation. RFF is a counterexample: its mathematics was completely clear at publication in 2007 (Bochner 1933 + Monte Carlo), which is exactly why it could stay relevant for a decade and be cited by countless follow-up works. The lesson on that road: if you can write the mathematical skeleton of an idea on three pages, it has a decade of life; if all you can say is “it works empirically”, it has two years. This whole series leans on a mathematical narrative rather than an engineering one, partly for this reason — the value of kernel methods is in the mathematical skeleton; understand the skeleton and the rest of the engineering follows.

What’s next#

Part 8 (the finale) covers Deep Kernel Learning vs Deep Learning: why neural networks won the empirical battle from 2012 onward, what the kernel view of deep learning (NTK, NNGP, GP-equivalent in the infinite-width limit) reveals about what a network is really doing, and the hybrid approaches (deep kernels, Gaussian process layers, kernelised attention) that resurrect kernels inside neural pipelines. We close the series by asking the honest question: in 2026, when should you actually use a kernel method instead of a deep model?


This is Part 7 of Kernel Methods (8 parts). Previous: Part 6 — Gaussian Processes · Next: Part 8 — Deep Kernel Learning vs Deep Learning

In this series

Kernel Methods 8 parts

  1. 01 Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
  2. 02 Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
  3. 03 Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
  4. 04 Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
  5. 05 Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
  6. 06 Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
  7. 07 Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features you are here
  8. 08 Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub