Deep Learning

Feb 7, 2026 ML Math Derivations 32 min read

ML Math Derivations (19): Neural Networks and Backpropagation

How does a neural network learn? This article derives forward propagation, the chain rule mechanics of backpropagation, vanishing/exploding gradients, and initialization strategies (Xavier, He).

Dec 10, 2025 Recommendation Systems 56 min read

Recommendation Systems (4): CTR Prediction and Click-Through Rate Modeling

A practical guide to CTR prediction models -- from Logistic Regression and Factorization Machines to DeepFM, xDeepFM, DCN, AutoInt, and FiBiNet -- with intuitive explanations and PyTorch implementations.

Dec 7, 2025 Recommendation Systems 40 min read

Recommendation Systems (3): Deep Learning Foundations

From MLPs to embeddings to NeuMF, YouTube DNN, and Wide & Deep -- a progressive walkthrough of the deep learning building blocks that power every modern recommender, with verified architectures and runnable PyTorch code.

Oct 26, 2025 NLP 32 min read

NLP (6): GPT and Generative Language Models

From GPT-1 to GPT-4: understand autoregressive language modeling, decoding strategies (greedy, beam search, top-k, top-p), in-context learning, and build a chatbot with HuggingFace.

Oct 21, 2025 NLP 32 min read

NLP (5): BERT and Pretrained Models

How BERT made bidirectional pretraining the default in NLP. We unpack the architecture, the 80/10/10 masking rule, fine-tuning recipes, and the RoBERTa/ALBERT/ELECTRA family with HuggingFace code.

Oct 16, 2025 NLP 34 min read

NLP (4): Attention Mechanism and Transformer

From the bottleneck of Seq2Seq to Attention Is All You Need. Build intuition for scaled dot-product attention, multi-head attention, positional encoding, masking, and assemble a complete Transformer in PyTorch.

Oct 11, 2025 NLP 30 min read

NLP (3): RNN and Sequence Modeling

How RNNs, LSTMs, and GRUs process sequences with memory. We derive vanishing gradients from first principles, build a character-level text generator, and implement a Seq2Seq translator in PyTorch.

Oct 6, 2025 NLP 16 min read

NLP (2): Word Embeddings and Language Models

Understand how Word2Vec, GloVe, and FastText turn words into vectors that capture meaning. Learn the math, train your own embeddings with Gensim, and connect embeddings to language models.

Oct 1, 2025 NLP 34 min read

NLP (1): Introduction and Text Preprocessing

A first-principles introduction to NLP and text preprocessing. We trace the four eras of the field, build the cleaning to vectorization pipeline by hand, and unpack the math behind tokenization, TF-IDF, n-grams, and …

Jul 30, 2025 Standalone 26 min read

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Make sense of the reparameterization trick and Gumbel-Softmax: why gradients can flow through sampling, how temperature trades bias for variance, and the practical pitfalls of training discrete latent variables …

May 31, 2025 Transfer Learning 50 min read

Transfer Learning (6): Multi-Task Learning

Train one model on multiple tasks simultaneously. Covers hard vs. soft parameter sharing, gradient conflicts (PCGrad, GradNorm, CAGrad), auxiliary task design, and a complete multi-task framework with dynamic weight …

May 25, 2025 Transfer Learning 44 min read

Transfer Learning (5): Knowledge Distillation

Compress large teacher models into small student models without losing much accuracy. Covers dark knowledge, temperature scaling, response-based / feature-based / relation-based distillation, self-distillation, and a …

May 19, 2025 Transfer Learning 50 min read

Transfer Learning (4): Few-Shot Learning

Learn new concepts from a handful of examples. Covers the N-way K-shot protocol, metric learning (Siamese, Prototypical, Matching, Relation networks), meta-learning (MAML, Reptile), episodic training, miniImageNet …

May 13, 2025 Transfer Learning 50 min read

Transfer Learning (3): Domain Adaptation

A practical guide to domain adaptation: covariate shift, label shift, DANN with gradient reversal, MMD alignment, CORAL, self-training, AdaBN, and a complete DANN implementation.

May 7, 2025 Transfer Learning 54 min read

Transfer Learning (2): Pre-training and Fine-tuning

Why pre-training learns a powerful prior from unlabeled data and how fine-tuning adapts it to your task. Covers contrastive learning, masked language models, discriminative learning rates, layer freezing, catastrophic …

May 1, 2025 Transfer Learning 46 min read

Transfer Learning (1): Fundamentals and Core Concepts

A beginner-friendly guide to transfer learning fundamentals: why it works, formal definitions, taxonomy, negative transfer, and a complete feature-transfer implementation with MMD domain adaptation.

Apr 16, 2025 Linear Algebra 34 min read

Essence of Linear Algebra (16): Linear Algebra in Deep Learning

Deep learning is large-scale matrix computation. From backpropagation as the chain rule in matrix form, to im2col turning convolutions into GEMM, to attention as soft retrieval via dot products -- see every core DL …

Mar 26, 2025 Linear Algebra 40 min read

Essence of Linear Algebra (13): Tensors and Multilinear Algebra

From scalars to high-dimensional data cubes -- tensors generalize vectors and matrices to arbitrary dimensions. Learn CP and Tucker decomposition, see how tensors compress neural networks and power recommender systems, …

Dec 15, 2024 Time Series Forecasting 36 min read

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Informer reduces Transformer complexity from O(L^2) to O(L log L) via ProbSparse attention, distilling, and a one-shot generative decoder. Full math, PyTorch code, and ETT/weather benchmarks.

Nov 30, 2024 Time Series Forecasting 38 min read

Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture

N-BEATS combines deep learning expressiveness with classical decomposition interpretability. Basis function expansion, double residual stacking, and M4 competition analysis with full PyTorch code.

Nov 15, 2024 Time Series Forecasting 36 min read

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

TCNs use causal dilated convolutions for parallel training and exponential receptive fields. Complete PyTorch implementation with traffic flow and sensor data case studies.

Oct 31, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (5): Transformer Architecture for Time Series

Transformers for time series, end to end: encoder-decoder anatomy, temporal positional encoding, the O(n^2) attention bottleneck, decoder-only forecasting, and patching. With variants (Autoformer, FEDformer, Informer, …

Oct 16, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies

Self-attention, multi-head attention, and positional encoding for time series. Step-by-step math, PyTorch implementations, and visualization techniques for interpretable forecasting.

Oct 1, 2024 Time Series Forecasting 32 min read

Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs

GRU distills LSTM into two gates for faster training and 25% fewer parameters. Learn when GRU beats LSTM, with formulas, benchmarks, PyTorch code, and a decision matrix.

Sep 16, 2024 Time Series Forecasting 30 min read

Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies

How LSTM's forget, input, and output gates solve the vanishing gradient problem. Complete PyTorch code for time series forecasting with practical tuning tips.

Jun 30, 2023 Standalone 12 min read

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.

Jun 27, 2023 Standalone 24 min read

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Build a VAE from scratch in PyTorch. Covers the ELBO objective, reparameterization trick, posterior collapse fixes, beta-VAE, and a complete training pipeline.

Sep 18, 2022 Optimization Theory 40 min read

Optimization (4): Learning Rate and Schedules

A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …

Sep 16, 2022 Optimization Theory 24 min read

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …

Dec 30, 2021 Kernel Methods 38 min read

Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Deep kernel learning combines neural feature extractors with kernel methods. When to pick kernels over deep nets, hyperparameter tuning playbook, common failure modes, and a final 5-step kernel decision flowchart.