Categories

Algorithm

Jul 30, 2025 Standalone 26 min read

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Make sense of the reparameterization trick and Gumbel-Softmax: why gradients can flow through sampling, how temperature trades bias for variance, and the practical pitfalls of training discrete latent variables …

Jul 28, 2025 Standalone 20 min read

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

From the least-squares view to the Moore-Penrose pseudoinverse, the four Penrose conditions, computation via SVD, truncated SVD, Tikhonov regularization, and modern applications from PCA to LoRA.

Dec 15, 2024 Time Series Forecasting 36 min read

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Informer reduces Transformer complexity from O(L^2) to O(L log L) via ProbSparse attention, distilling, and a one-shot generative decoder. Full math, PyTorch code, and ETT/weather benchmarks.

Nov 30, 2024 Time Series Forecasting 38 min read

Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture

N-BEATS combines deep learning expressiveness with classical decomposition interpretability. Basis function expansion, double residual stacking, and M4 competition analysis with full PyTorch code.

Nov 15, 2024 Time Series Forecasting 36 min read

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

TCNs use causal dilated convolutions for parallel training and exponential receptive fields. Complete PyTorch implementation with traffic flow and sensor data case studies.

Oct 31, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (5): Transformer Architecture for Time Series

Transformers for time series, end to end: encoder-decoder anatomy, temporal positional encoding, the O(n^2) attention bottleneck, decoder-only forecasting, and patching. With variants (Autoformer, FEDformer, Informer, …

Oct 16, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies

Self-attention, multi-head attention, and positional encoding for time series. Step-by-step math, PyTorch implementations, and visualization techniques for interpretable forecasting.

Oct 1, 2024 Time Series Forecasting 32 min read

Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs

GRU distills LSTM into two gates for faster training and 25% fewer parameters. Learn when GRU beats LSTM, with formulas, benchmarks, PyTorch code, and a decision matrix.

Sep 16, 2024 Time Series Forecasting 30 min read

Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies

How LSTM's forget, input, and output gates solve the vanishing gradient problem. Complete PyTorch code for time series forecasting with practical tuning tips.

Sep 1, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (1): Traditional Statistical Models

ARIMA, SARIMA, VAR, GARCH, Prophet, exponential smoothing and the Kalman filter, derived from a single state-space view. With Box-Jenkins workflow, ACF/PACF identification, and runnable Python.

Jun 30, 2023 Standalone 12 min read

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.

Jun 27, 2023 Standalone 24 min read

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Build a VAE from scratch in PyTorch. Covers the ELBO objective, reparameterization trick, posterior collapse fixes, beta-VAE, and a complete training pipeline.

Sep 30, 2022 Optimization Theory 38 min read

Optimization (12): Discrete and Global Optimization

When variables are integer-valued or the problem is non-convex with multiple basins, classical convex methods stop working. This article surveys what does work: integer programming via branch-and-bound, LP relaxation gap …

Sep 29, 2022 Optimization Theory 22 min read

Optimization (11): Non-Convex Optimization and Saddle Escape

Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …

Sep 27, 2022 Optimization Theory 20 min read

Optimization (10): Stochastic Optimization and Variance Reduction

Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient noise budget. Then variance reduction: SVRG, SAGA, Katyusha — methods that get to the linear rate of …

Sep 26, 2022 Optimization Theory 22 min read

Optimization (9): Interior-Point Methods and Self-Concordant Barriers

How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic barrier, parametrize the central path, and apply Newton's method. Includes the self-concordance …

Sep 24, 2022 Optimization Theory 20 min read

Optimization (8): Lagrangian Duality and KKT Conditions

How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as necessary and sufficient optimality, and why the SVM dual is much smaller than the SVM primal. …

Sep 22, 2022 Optimization Theory 24 min read

Optimization (7): Second-Order Methods

Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, derive BFGS from a secant condition + low-rank update, walk through L-BFGS's two-loop recursion that …

Sep 21, 2022 Optimization Theory 32 min read

Optimization (6): Composite Optimization and Proximal Methods

A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and how they power ISTA, FISTA, ADMM, LASSO, and SVM in practice.

Sep 20, 2022 Optimization Theory 26 min read

Optimization (5): Acceleration Beyond Nesterov

What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, derive Polyak's Heavy-Ball method as the continuous-time limit, build a unified Lyapunov framework …

Sep 18, 2022 Optimization Theory 40 min read

Optimization (4): Learning Rate and Schedules

A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …

Sep 16, 2022 Optimization Theory 24 min read

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …

Sep 15, 2022 Optimization Theory 28 min read

Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa …

Sep 14, 2022 Optimization Theory 26 min read

Optimization (1): Convex Analysis Foundations

The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate (Fenchel) transform, subgradients, and the indicator/support function pair. Includes complete proofs of …

Dec 30, 2021 Kernel Methods 38 min read

Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Deep kernel learning combines neural feature extractors with kernel methods. When to pick kernels over deep nets, hyperparameter tuning playbook, common failure modes, and a final 5-step kernel decision flowchart.

Dec 24, 2021 Kernel Methods 52 min read

Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features

Kernel methods are O(n^3). Nystrom approximation and Random Fourier Features pull them back to linear time without giving up the kernel trick's expressive power.

Dec 19, 2021 Kernel Methods 34 min read

Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference

Gaussian Processes turn kernels into a Bayesian model — posterior with uncertainty, marginal likelihood for hyperparameters, and the kernel as a prior over functions.

Dec 14, 2021 Kernel Methods 44 min read

Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression

The classic algorithms, kernelized — SVM's dual form, Kernel PCA's eigendecomposition in feature space, and Kernel Ridge's closed-form solution. With sklearn code and worked examples.

Dec 9, 2021 Kernel Methods 44 min read

Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More

A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.

Dec 4, 2021 Kernel Methods 44 min read

Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods

Reproducing Kernel Hilbert Space — the function space where kernel methods live. The reproducing property, the representer theorem, and why finite-data optimization works in infinite dimensions.

Nov 29, 2021 Kernel Methods 76 min read

Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem

What makes a function a valid kernel? Positive-definiteness, the Gram matrix test, and Mercer's theorem — the spectral decomposition that justifies the kernel trick.

Nov 24, 2021 Kernel Methods 66 min read

Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms

Linear algorithms can't capture non-linear patterns. The kernel trick lets you keep the linear algorithm's elegance AND model non-linear relationships — without writing the high-dimensional feature map. Part 1 of an …