<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kernel-Methods on Chen Kai Blog</title><link>https://www.chenk.top/en/series/kernel-methods/</link><description>Recent content in Kernel-Methods on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 30 Dec 2021 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/series/kernel-methods/index.xml" rel="self" type="application/rss+xml"/><item><title>Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide</title><link>https://www.chenk.top/en/kernel-methods/08-deep-kernels-vs-dl/</link><pubDate>Thu, 30 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/08-deep-kernels-vs-dl/</guid><description>&lt;p>In 2026, why are you still reading about kernel methods? Aren&amp;rsquo;t transformers supposed to have eaten the entire ML stack? Yes and no. Transformers eat the headlines, but kernels still eat the corners — the regimes with 200 samples, the regimes where the model has to publish calibrated error bars, the regimes where a physicist needs to know &lt;em>which&lt;/em> basis function caused the prediction. This final part is the field manual: when kernels actually win, how to debug them when they don&amp;rsquo;t, and how to bolt them on top of a neural network when you want the best of both worlds.&lt;/p></description></item><item><title>Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features</title><link>https://www.chenk.top/en/kernel-methods/07-large-scale-kernels/</link><pubDate>Fri, 24 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/07-large-scale-kernels/</guid><description>&lt;p>You want to train an RBF SVM on a million-image classification set. The Gram matrix is &lt;span class="math-inline">$10^6 \times 10^6$&lt;/span>
 doubles, which is &lt;strong>8 TB&lt;/strong>. That number alone — eight terabytes of RAM, just to &lt;em>store&lt;/em> the kernel — is why most working data scientists who learned kernel methods in a stats class quietly never reach for them on real production workloads. The kernel trick gives you an infinite-dimensional feature space for the cost of one dot product per pair; the bill arrives when you have &lt;span class="math-inline">$n^2$&lt;/span>
 pairs.&lt;/p></description></item><item><title>Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference</title><link>https://www.chenk.top/en/kernel-methods/06-gaussian-processes/</link><pubDate>Sun, 19 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/06-gaussian-processes/</guid><description>&lt;p>Kernel ridge regression gives you a number. You feed it &lt;span class="math-inline">$x_*$&lt;/span>
, it returns &lt;span class="math-inline">$\hat{y}_* = 23.7$&lt;/span>
. End of story. But you wanted to &lt;em>act&lt;/em> on that prediction — maybe schedule a delivery, dose a patient, place a bet — and the single number is not enough. Tomorrow&amp;rsquo;s temperature being &amp;ldquo;25°C&amp;rdquo; is useful; &amp;ldquo;very likely 25°C, 95% chance between 22 and 28&amp;rdquo; is &lt;em>actionable&lt;/em>. Every decision under uncertainty needs the second one. Gaussian Processes are the cleanest way to upgrade a kernel method from &amp;ldquo;point predictor&amp;rdquo; to &amp;ldquo;distribution predictor&amp;rdquo;, and they do it without abandoning a single line of the kernel math from the previous five parts.&lt;/p></description></item><item><title>Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression</title><link>https://www.chenk.top/en/kernel-methods/05-kernel-algorithms/</link><pubDate>Tue, 14 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/05-kernel-algorithms/</guid><description>&lt;p>Your features are two-dimensional, your data is clearly a circle inside a circle, and &lt;code>LinearSVC&lt;/code> is at 50% accuracy with the wide-eyed look of an algorithm that genuinely believes a straight line is the answer. You stare at the scatter plot, you stare at the model, and somewhere in the back of your head the words &lt;em>kernel SVM&lt;/em> surface. You type &lt;code>kernel='rbf'&lt;/code>, the accuracy jumps to 0.98, and the rest of the afternoon you wonder what exactly just happened — and why the same trick also gives you a Kernel PCA that unfolds a Swiss roll and a Kernel Ridge regressor that fits a sine wave with three lines of code.&lt;/p></description></item><item><title>Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More</title><link>https://www.chenk.top/en/kernel-methods/04-common-kernels/</link><pubDate>Thu, 09 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/04-common-kernels/</guid><description>&lt;p>You type &lt;code>SVC(kernel='rbf')&lt;/code> in scikit-learn for the first time. What did you set &lt;code>gamma&lt;/code> to? &lt;code>'scale'&lt;/code>? &lt;code>'auto'&lt;/code>? You scrolled past those defaults without thinking. Three months later your model is overfitting, your Gram matrix looks like the identity, and you have no idea which knob is wrong. Most &amp;ldquo;kernel tuning&amp;rdquo; debt is really &lt;em>kernel choice&lt;/em> debt — you picked the default kernel for the wrong reason, and now no amount of grid search will save you.&lt;/p></description></item><item><title>Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods</title><link>https://www.chenk.top/en/kernel-methods/03-rkhs/</link><pubDate>Sat, 04 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/03-rkhs/</guid><description>&lt;p>If your eyes glaze over the moment a lecturer writes &amp;ldquo;RKHS&amp;rdquo; on the board, this part of the series is for you. RKHS is not a club of three intimidating letters — it is a function space, and once you see what lives inside it, kernel methods stop feeling like magic and start feeling like linear algebra you already know.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/kernel-methods/03-rkhs/fig1_hilbert_space_concept.png" alt="A Hilbert-space cover for Part 3 of the kernel-methods series" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p></description></item><item><title>Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem</title><link>https://www.chenk.top/en/kernel-methods/02-kernel-math-foundations/</link><pubDate>Mon, 29 Nov 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/02-kernel-math-foundations/</guid><description>&lt;p>A week into kernel-SVM hacking I wrote what felt like a perfectly reasonable similarity function — &lt;code>tanh(1.5 * x.dot(y) - 2.0)&lt;/code>. It compiled, it ran, the math looked symmetric. Then sklearn coughed up &lt;code>ValueError: kernel matrix is not positive semidefinite&lt;/code> and the optimiser produced a model that was &lt;em>worse&lt;/em> than guessing.&lt;/p>
&lt;p>That error message turned out to hide one of the deepest results in 20th-century analysis. &amp;ldquo;Positive-definite&amp;rdquo; is not a checkbox — it is the entire reason the kernel trick is allowed to exist. If your function is PSD, there exists a Hilbert space where it is a real inner product; if it is not, you are pretending to live in a space that nobody built. This post unpacks that statement, builds the operational tests, derives Mercer&amp;rsquo;s theorem, and works through enough numerical examples that the next time you see the failure message you will know exactly which line of math your kernel violated.&lt;/p></description></item><item><title>Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms</title><link>https://www.chenk.top/en/kernel-methods/01-why-kernels/</link><pubDate>Wed, 24 Nov 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/01-why-kernels/</guid><description>&lt;p>The first time I tried to fit a logistic regression to a dataset of two interlocking spirals, I burned an afternoon tweaking the regularizer, swapping solvers, and rescaling features — convinced that I was doing something wrong. The accuracy hovered around 50%. That is the noise floor of a coin flip; my model was, in a very literal sense, learning nothing.&lt;/p>
&lt;p>The model was not buggy. The data was simply not the kind of object a straight line can describe. No amount of &lt;code>C&lt;/code>, &lt;code>class_weight&lt;/code>, or &lt;code>tol&lt;/code> was going to change that. Once you have seen this failure mode once, you start noticing it everywhere — in customer-churn data with non-monotone relationships, in image classification before deep learning, in any regression where the trend bends. A linear algorithm has a hard ceiling, and you only break through that ceiling by changing the kind of object the algorithm operates on.&lt;/p></description></item></channel></rss>