Algorithm
Reparameterization Trick & Gumbel-Softmax: A Deep Dive
Make sense of the reparameterization trick and Gumbel-Softmax: why gradients can flow through sampling, how temperature trades bias for variance, and the practical pitfalls of training discrete latent variables …
Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization
From the least-squares view to the Moore-Penrose pseudoinverse, the four Penrose conditions, computation via SVD, truncated SVD, Tikhonov regularization, and modern applications from PCA to LoRA.
Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting
Informer reduces Transformer complexity from O(L^2) to O(L log L) via ProbSparse attention, distilling, and a one-shot generative decoder. Full math, PyTorch code, and ETT/weather benchmarks.
Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture
N-BEATS combines deep learning expressiveness with classical decomposition interpretability. Basis function expansion, double residual stacking, and M4 competition analysis with full PyTorch code.
Time Series Forecasting (6): Temporal Convolutional Networks (TCN)
TCNs use causal dilated convolutions for parallel training and exponential receptive fields. Complete PyTorch implementation with traffic flow and sensor data case studies.
Time Series Forecasting (5): Transformer Architecture for Time Series
Transformers for time series, end to end: encoder-decoder anatomy, temporal positional encoding, the O(n^2) attention bottleneck, decoder-only forecasting, and patching. With variants (Autoformer, FEDformer, Informer, …
Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies
Self-attention, multi-head attention, and positional encoding for time series. Step-by-step math, PyTorch implementations, and visualization techniques for interpretable forecasting.
Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs
GRU distills LSTM into two gates for faster training and 25% fewer parameters. Learn when GRU beats LSTM, with formulas, benchmarks, PyTorch code, and a decision matrix.
Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies
How LSTM's forget, input, and output gates solve the vanishing gradient problem. Complete PyTorch code for time series forecasting with practical tuning tips.
Time Series Forecasting (1): Traditional Statistical Models
ARIMA, SARIMA, VAR, GARCH, Prophet, exponential smoothing and the Kalman filter, derived from a single state-space view. With Box-Jenkins workflow, ACF/PACF identification, and runnable Python.
Position Encoding Brief: From Sinusoidal to RoPE and ALiBi
A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.
Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting
Build a VAE from scratch in PyTorch. Covers the ELBO objective, reparameterization trick, posterior collapse fixes, beta-VAE, and a complete training pipeline.
Optimization (12): Discrete and Global Optimization
When variables are integer-valued or the problem is non-convex with multiple basins, classical convex methods stop working. This article surveys what does work: integer programming via branch-and-bound, LP relaxation gap …
Optimization (11): Non-Convex Optimization and Saddle Escape
Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …
Optimization (10): Stochastic Optimization and Variance Reduction
Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient noise budget. Then variance reduction: SVRG, SAGA, Katyusha — methods that get to the linear rate of …
Optimization (9): Interior-Point Methods and Self-Concordant Barriers
How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic barrier, parametrize the central path, and apply Newton's method. Includes the self-concordance …
Optimization (8): Lagrangian Duality and KKT Conditions
How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as necessary and sufficient optimality, and why the SVM dual is much smaller than the SVM primal. …
Optimization (7): Second-Order Methods
Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, derive BFGS from a secant condition + low-rank update, walk through L-BFGS's two-loop recursion that …
Optimization (6): Composite Optimization and Proximal Methods
A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and how they power ISTA, FISTA, ADMM, LASSO, and SVM in practice.
Optimization (5): Acceleration Beyond Nesterov
What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, derive Polyak's Heavy-Ball method as the continuous-time limit, build a unified Lyapunov framework …
Optimization (4): Learning Rate and Schedules
A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …
Optimization (3): The Gradient Descent Family from SGD to AdamW
One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …
Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration
Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa …
Optimization (1): Convex Analysis Foundations
The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate (Fenchel) transform, subgradients, and the indicator/support function pair. Includes complete proofs of …
Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide
Deep kernel learning combines neural feature extractors with kernel methods. When to pick kernels over deep nets, hyperparameter tuning playbook, common failure modes, and a final 5-step kernel decision flowchart.
Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
Kernel methods are O(n^3). Nystrom approximation and Random Fourier Features pull them back to linear time without giving up the kernel trick's expressive power.
Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
Gaussian Processes turn kernels into a Bayesian model — posterior with uncertainty, marginal likelihood for hyperparameters, and the kernel as a prior over functions.
Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
The classic algorithms, kernelized — SVM's dual form, Kernel PCA's eigendecomposition in feature space, and Kernel Ridge's closed-form solution. With sklearn code and worked examples.
Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.
Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
Reproducing Kernel Hilbert Space — the function space where kernel methods live. The reproducing property, the representer theorem, and why finite-data optimization works in infinite dimensions.
Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
What makes a function a valid kernel? Positive-definiteness, the Gram matrix test, and Mercer's theorem — the spectral decomposition that justifies the kernel trick.
Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
Linear algorithms can't capture non-linear patterns. The kernel trick lets you keep the linear algorithm's elegance AND model non-linear relationships — without writing the high-dimensional feature map. Part 1 of an …



























