Chen Kai Blog

Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Thu, 26 Mar 2026 09:00:00 +0000

This is the article where everything from articles 2 through 7 lands in one place. By the end you’ll have run terraform apply once and produced a complete, observable, budgeted agent runtime stack on Alibaba Cloud. About 31 resources, ~7 minutes of wall clock.

The stack we’re building:

Five layers — edge, compute, memory, platform, ops — composed from the modules we built across this series.

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms

Tue, 24 Mar 2026 09:00:00 +0000

Agents are non-deterministic, multi-step, and call expensive APIs. The combination means you cannot debug them after the fact unless you instrumented them on day one. This article wires three pipelines through Terraform — logs, traces, metrics — into a unified dashboard, then layers four alarms that have actually fired and saved my projects in production.

By the end you have one DingTalk channel that pings before the bill explodes, the latency dies, the error rate spikes, or some agent starts looping on itself.

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Sun, 22 Mar 2026 09:00:00 +0000

A pattern I see repeatedly in immature agent stacks: each agent has its own copy of OPENAI_API_KEY in its own .env file. Sometimes the same key, sometimes different ones, sometimes a colleague’s personal key from when they prototyped. When the bill arrives nobody can tell which agent caused which token spend, and when a key leaks (it always does) you’re playing whack-a-mole across a dozen .env files.

This article ends that. We build one LLM gateway that:

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

Fri, 20 Mar 2026 09:00:00 +0000

An agent’s memory is the part most tutorials hand-wave. “Just put the embeddings in Pinecone, the sessions in Postgres, the screenshots in S3.” On Aliyun, all three exist as managed services, and Terraform-provisioning them right is the difference between “memory works” and “we lost three weeks of conversation history because the disk filled up at 4am”.

This article covers all three layers, the Terraform for each, and the boring-but-critical lifecycle and backup rules.

Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?

Wed, 18 Mar 2026 09:00:00 +0000

The single most important architecture decision in an agent system is where the agent loop process actually runs. There are exactly three good answers on Aliyun. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs you weeks of unnecessary scaffolding.

This article walks through all three with working Terraform, the cost crossover, and the operational gotchas.

The three patterns

Terraform for AI Agents (3): A Reusable VPC and Security Baseline

Mon, 16 Mar 2026 09:00:00 +0000

This article builds the single most copied piece of Terraform in my agent projects: a vpc-baseline module that gives every later component (ECS, RDS, OpenSearch, ACK) a sane place to land.

By the end you’ll have:

A VPC across three availability zones in one region
Six subnets (one public + one private per zone) with non-overlapping CIDRs
A NAT gateway with EIP for private-subnet outbound to LLM APIs
Three security groups stacked by tier (ALB → agent runtime → memory)
Three KMS customer master keys, one per data domain (memory, secrets, logs)
A clean module interface: name + CIDR + zones in, IDs out

It’s about 200 lines of HCL all-in. Worth typing once, refer to it forever.

Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS

Sat, 14 Mar 2026 09:00:00 +0000

This is the article where you stop reading and start typing. By the end you will have:

The alicloud Terraform provider installed and version-pinned
Authentication wired up — through the right method, not the convenient one
Remote state on an OSS bucket with Tablestore-based locking
Three workspaces (dev, staging, prod) that share a backend but isolate state
A working terraform plan against an empty config

Nothing here provisions an agent yet. We’re laying the foundation that every later article assumes.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Thu, 12 Mar 2026 09:00:00 +0000

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.

Aliyun PAI (5): Designer vs Model Gallery — When the GUIs Actually Earn Their Keep

Mon, 09 Mar 2026 09:00:00 +0000

The first four articles were about the underlying primitives — DSW, DLC, EAS — that you orchestrate with Python. This one is about the two GUI products that wrap those primitives and ship a runnable thing for users who do not want to write Python: PAI-Designer for drag-and-drop tabular pipelines, and Model Gallery for zero-code open-source model deployment and fine-tuning. They are not what serious engineers reach for first, but in two specific situations they are obviously the right answer.

Aliyun PAI (4): PAI-EAS — Model Serving, Cold Starts, and the TPS Lie

Sun, 08 Mar 2026 09:00:00 +0000

EAS is where the money goes. DSW costs you a few hundred RMB a month for dev. DLC costs you in spikes. EAS bills 24/7 because someone might call your endpoint, and that “minimum replica count” line in the autoscaler config is the single highest-leverage knob in the whole platform. This article is what I wish I’d known the day before we shipped our first production endpoint.

What EAS is, per the docs

The official “EAS overview” frames it as: “deploy trained models as online inference services or AI web applications, with heterogeneous resources, automatic scaling, one-click stress testing, canary releases, and real-time monitoring”. The two things to underline:

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain

Sat, 07 Mar 2026 09:00:00 +0000

A DSW notebook is for one engineer on one GPU. The moment you need eight GPUs across two nodes, or the moment training runs longer than the eight hours you’ll keep the tab open, you switch to DLC. DLC is PAI’s job-submission front-end for a managed Kubernetes cluster: you describe what you want (image, command, resources, data mounts), DLC schedules pods, runs them to completion, persists logs, and tells you what happened. The docs call this Deep Learning Containers; we just say “DLC job”.

Aliyun PAI (2): PAI-DSW — Notebooks That Don't Eat Your Weights

Fri, 06 Mar 2026 09:00:00 +0000

Every time I onboard a new ML engineer to PAI the first day looks the same. They start a DSW instance, pip install their world, train for an hour, restart the kernel for some reason, and then ask me where their model file went. The honest answer — “in /root on a node that no longer exists” — is the kind of lesson you only need to learn once. This article is the version of that lesson you read in advance.

Aliyun PAI (1): Platform Overview and the Product Family Map

Thu, 05 Mar 2026 09:00:00 +0000

If your team trains or serves any model on Alibaba Cloud, sooner or later you will end up in the PAI console. PAI is the umbrella; underneath it sit the actual workhorses — a notebook product, a distributed training service, a model-serving service, plus a couple of GUI/quick-deploy layers on top. After about eighteen months of running real LLM workloads on it for an AI marketing platform, this series is the field guide I wish someone had handed me before I shipped my first endpoint.

Aliyun Bailian (5): Qwen-TTS for Multilingual Voice

Sun, 01 Mar 2026 09:00:00 +0000

The reason every Chinese-language product I’ve worked on ends up calling Qwen-TTS-Flash isn’t price — there are cheaper TTS APIs. It’s that Qwen-TTS is the only one that handles mainland Chinese dialects (Cantonese, Sichuanese, Wu) and English in the same SDK, with voices that don’t sound like a 2019 customs announcement. After about six months of using it for a marketing-video voice-over pipeline, this is what I wish someone had told me on day one.

Aliyun Bailian (4): Wanxiang Video Generation End-to-End

Sat, 28 Feb 2026 09:00:00 +0000

Wanxiang is the API that has done the most for our marketing pipeline and caused the most production surprises. The model is genuinely good — wan2.5-t2v-plus produces 720p clips that pass for an actual video team’s output most of the time — but the surface around it is async, native-protocol, has expiring URLs, and rate-limits in non-obvious ways. This article is the version of the docs that has been through six months of “why is this happening at 2am” tickets.

Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding

Fri, 27 Feb 2026 09:00:00 +0000

Of all the Bailian models, Qwen-Omni is the one that has pulled me out of the most product-roadmap holes. “Can you tell me what’s happening in this 2-minute promo video?” used to be a 3-week project involving frame extraction, captioning per frame, and a stitch step. With Qwen-Omni it is one HTTP request. But the docs are sparse on the gotchas, and there is one (streaming is mandatory) that has cost more than one team a half-day. Let’s not have that be you.

Aliyun Bailian (2): The Qwen LLM API in Production

Thu, 26 Feb 2026 09:00:00 +0000

This is the article in the series where most of the production wins live. The other models are interesting; the LLMs are what every product I have shipped on Bailian has called every minute of every day. The official Qwen API reference is dense and complete; this article is the readable companion that picks one path through it.

Pick the right Qwen variant for the workload

The Qwen family is large. Most teams overspend by defaulting to qwen-max everywhere. Most teams underspend on quality by defaulting to qwen-turbo. The right answer is “match variant to job”:

Aliyun Bailian (1): Platform Overview and First Request

Wed, 25 Feb 2026 09:00:00 +0000

If you ship anything that touches Chinese-language users, sooner or later you will end up calling a Bailian model. Qwen-Max is the cheapest sane way to get GPT-4-class Chinese understanding, the Wanxiang video models are the only production-grade text-to-video API I can buy with a Chinese invoice, and Qwen-TTS-Flash is the only TTS that handles Cantonese and Sichuanese without sounding like a customs announcement. After about a year of running these in production for an AI-marketing platform, this series is what I wish someone had handed me on day one.

ML Math Derivations (20): Regularization and Model Selection

Sun, 08 Feb 2026 09:00:00 +0000

What This Article Covers

A 100-million-parameter network trained on 50,000 images should overfit catastrophically. Modern deep networks generalise anyway. Why? Two ingredients: regularisation (techniques that constrain capacity) and generalisation theory (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: why does learning generalise?

ML Math Derivations (19): Neural Networks and Backpropagation

Sat, 07 Feb 2026 09:00:00 +0000

What This Article Covers

A single perceptron cannot solve XOR. Stack enough of them with nonlinear activations and you obtain a universal function approximator. The remaining question is how such a network learns from data. The answer — backpropagation, an efficient application of the chain rule that recycles intermediate results during a single backward sweep — is the engine behind every deep learning library written in the last forty years. Understanding it mathematically reveals two further truths: why deep networks suffer from vanishing or exploding gradients, and why the choice of weight initialization is much less arbitrary than it first appears.

ML Math Derivations (18): Clustering Algorithms

Fri, 06 Feb 2026 09:00:00 +0000

What This Article Covers

A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is clustering, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: what does “similar” even mean? Every clustering algorithm is, at heart, a different answer to that question – a different geometric, probabilistic, or graph-theoretic prior on what a “group” is.

What you will learn:

ML Math Derivations (17): Dimensionality Reduction and PCA

Thu, 05 Feb 2026 09:00:00 +0000

What This Article Covers

Feed a clustering algorithm $10{,}000$-dimensional data and it will most likely fail – not because the algorithm is broken, but because high-dimensional space is a hostile environment for distance-based learning. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to $1$, and “closeness” stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.

ML Math Derivations (16): Conditional Random Fields

Wed, 04 Feb 2026 09:00:00 +0000

What This Article Covers

Named entity recognition, POS tagging, information extraction – every one of these tasks asks you to label each element of a sequence. HMMs (Part 15 ) attack this problem generatively by modelling the joint distribution $P(\mathbf{X},\mathbf{Y})$, but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether bank is a noun or a verb depends on the preceding word, the following word, the suffix, capitalisation, dictionary lookups – all of these features at once.

Machine Learning Mathematical Derivations (15): Hidden Markov Models

Tue, 03 Feb 2026 09:00:00 +0000

You hear footsteps behind you in a fog. You cannot see the walker, only the sounds. From the rhythm and pitch – short, soft, hurried – can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is any sequence of sounds under your model of how walking works?

These are the three problems of HMMs, and the surprise is that all three reduce to one trick: write the joint $P(\mathbf{O}, \mathbf{I})$ as a product of local factors along time, then share sub-computations across time with dynamic programming. Brute force costs $O(N^T)$. Forward-Backward, Viterbi, and Baum-Welch all cost $O(N^2 T)$. The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.

Machine Learning Mathematical Derivations (14): Variational Inference and Variational EM

Mon, 02 Feb 2026 09:00:00 +0000

When the posterior $p(\mathbf{z}\mid\mathbf{x})$ is intractable, you have two roads. Sampling (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. Variational inference (VI) instead picks a simple family $\mathcal{Q}$ of distributions and finds the member $q^\star\in\mathcal{Q}$ that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.

Machine Learning Mathematical Derivations (13): EM Algorithm and GMM

Sun, 01 Feb 2026 09:00:00 +0000

When data carries hidden structure – a cluster label you never observed, a missing feature, a topic you cannot directly see – maximum likelihood becomes painful. The log of a sum has no closed form, and gradient methods get tangled in the latent variables. The EM algorithm sidesteps the difficulty with a deceptively simple idea: alternate between guessing the hidden variables under a posterior (E-step) and fitting the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property via Jensen’s inequality, and works through its most famous application: Gaussian Mixture Models (GMM) – the soft, elliptical generalisation of K-means.

Machine Learning Mathematical Derivations (12): XGBoost and LightGBM

Sat, 31 Jan 2026 09:00:00 +0000

XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles — on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, Part 11) but make very different engineering bets:

XGBoost sharpens the math: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.
LightGBM sharpens the systems: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).

The result is two tools that look interchangeable from the API but behave very differently when $N$ or $d$ becomes large. This post derives every formula behind those choices so you can read a tuning guide and know why each knob exists.

Machine Learning Mathematical Derivations (11): Ensemble Learning

Fri, 30 Jan 2026 09:00:00 +0000

Why does a committee of mediocre classifiers outperform a single brilliant one? The answer is unromantic but precise: averaging cuts variance, sequential reweighting cuts bias, and a little randomisation breaks the correlation that would otherwise destroy both effects. This post derives the mathematics behind that picture — bias–variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimisation of exponential loss, and gradient boosting as gradient descent in function space.

By the end you should be able to look at any ensemble method and say what it is reducing, why it works, and when it will fail.

Machine Learning Mathematical Derivations (10): Semi-Naive Bayes and Bayesian Networks

Thu, 29 Jan 2026 09:00:00 +0000

Hook. Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie – one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from “no dependencies” (Naive Bayes) to “all dependencies” (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.

Machine Learning Mathematical Derivations (9): Naive Bayes

Wed, 28 Jan 2026 09:00:00 +0000

Hook: A spam filter that trains in milliseconds, scales to a million features, has no hyperparameters worth tuning, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding why is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.

Machine Learning Mathematical Derivations (8): Support Vector Machines

Tue, 27 Jan 2026 09:00:00 +0000

Hook. You have two clouds of points and infinitely many lines that separate them. Which line is “best”? SVM gives a startlingly geometric answer: the line that sits in the middle of the widest empty corridor between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and – almost as a free gift – the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.

Machine Learning Mathematical Derivations (7): Decision Trees

Mon, 26 Jan 2026 09:00:00 +0000

Hook. A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us which question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.

Machine Learning Mathematical Derivations (6): Logistic Regression and Classification

Sun, 25 Jan 2026 09:00:00 +0000

Hook. Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a classification algorithm, and its math underpins every neuron in every modern neural network.

What You Will Learn

Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.
How cross-entropy loss falls out of maximum likelihood estimation in two lines.
Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.
The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.
L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.
Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.

Prerequisites

Calculus: chain rule, partial derivatives.
Linear algebra: matrix multiplication, transpose.
Probability: Bernoulli and categorical distributions, likelihood.
Familiarity with Part 5: Linear Regression .

1. From Linear Models to Probabilistic Classification

1.1 The Problem with Raw Linear Output

Linear regression gives us $\hat y = \mathbf{w}^\top \mathbf{x}$, which is unbounded. For classification, two things go wrong:

Mathematical Derivation of Machine Learning (5): Linear Regression

Sat, 24 Jan 2026 09:00:00 +0000

Hook. In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean regression, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning – not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: fit a line, but in the right space.

ML Math Derivations (4): Convex Optimization Theory

Fri, 23 Jan 2026 09:00:00 +0000

What This Article Covers

In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to some optimization problem.

Among all such problems, convex optimization holds a privileged place. The defining property is so strong it almost feels like cheating: every local minimum is automatically a global minimum, and a handful of well-understood algorithms come with airtight convergence guarantees. The whole reason we treat “convex” as a green flag and “non-convex” as a yellow one comes down to this single fact.

ML Math Derivations (3): Probability Theory and Statistical Inference

Thu, 22 Jan 2026 09:00:00 +0000

What This Article Covers

In 1912, Ronald Fisher introduced maximum likelihood estimation in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: if a parameter setting makes the observed data extremely likely, that parameter setting is probably right. Almost every modern learning algorithm — from logistic regression to large language models — is a descendant of this idea.

But likelihood alone is not enough. To use it we need a vocabulary for uncertainty (probability spaces, distributions), guarantees that empirical quantities track population ones (laws of large numbers, central limit theorem), and tools for incorporating prior knowledge (Bayesian inference). This article assembles those pieces into a coherent foundation for everything that follows.

ML Math Derivations (2): Linear Algebra and Matrix Theory

Wed, 21 Jan 2026 09:00:00 +0000

Why this chapter, and what’s different

If you have already worked through a standard linear-algebra course you have seen most of these objects. This chapter is not that course. It is the ML practitioner’s slice of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.

Concretely the goals are:

Build a geometric intuition for what matrices do (rotate, stretch, project, kill).
Learn the four decompositions that show up everywhere – spectral, SVD, QR, Cholesky – and which one to reach for.
Master enough matrix calculus to derive any neural-net gradient on the back of an envelope.

We skim the algebra of row reduction, determinants by cofactor, and abstract vector-space proofs. If you need those, the references at the bottom give the standard treatments. Here, every concept comes back to a picture or a line of NumPy.

Solving Constrained Mean-Variance Portfolio Optimization Using Spiral Optimization

Wed, 21 Jan 2026 09:00:00 +0000

Markowitz’s mean-variance model is elegant until you add real trading constraints: “if you buy a stock at all, hold at least 5% of it” and “pick exactly 10 names from the S&P 500.” The closed-form quadratic program quietly mutates into a mixed-integer nonlinear program (MINLP), and the standard solver chain (Lagrange multipliers, KKT conditions, interior-point methods) stops working. The paper reviewed here applies the Spiral Optimization Algorithm (SOA), a population-based metaheuristic, to this problem and shows it can find competitive feasible solutions where gradient methods fail outright.

ML Math Derivations (1): Introduction and Mathematical Foundations

Tue, 20 Jan 2026 09:00:00 +0000

What this chapter does

In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: a system that has never been told the rules of a language can still recover them, given enough examples. Why?

The answer is not a trick of engineering – it is a theorem. In this chapter we build, from first principles, the part of mathematics that explains when learning from data is possible, how much data is required, and what fundamentally limits what any algorithm can do.

Recommendation Systems (16): Industrial Architecture and Best Practices

Thu, 15 Jan 2026 09:00:00 +0000

The hardest part of a production recommendation system is not the model. It is the system around the model: the feature store that prevents training/serving skew, the canary deployment that catches a regression before it hits 100M users, the orchestration that meets a 100ms p95 latency budget while running four ML models in sequence. This final article describes the architecture that every major tech company has converged on – and the trade-offs hiding inside each layer.

Recommendation Systems (15): Real-Time Recommendation and Online Learning

Mon, 12 Jan 2026 09:00:00 +0000

A user opens your app at 14:02 and searches for “trail running shoes”. By 15:30 they have moved on and are reading kitchen reviews. A model that retrains nightly is still showing them Salomon ads at 16:00 — and that gap is exactly the bug a real-time system fixes. The interesting part is not “make it faster” but “what should be fast” — most features add nothing to AUC even when made real-time, and the wrong design point burns money for no lift.

Recommendation Systems (14): Cross-Domain Recommendation and Cold-Start Solutions

Fri, 09 Jan 2026 09:00:00 +0000

When Netflix launches in a new country, it inherits millions of users with zero history and a catalog with no local ratings. Amazon faces the same problem each time it opens a new product category. Pure collaborative filtering — the workhorse of warm-state recommendation — has nothing to compute on. The discipline that makes recommendations work in this regime is a stack of techniques: bootstrap heuristics for the first request, meta-learning after a handful of interactions, cross-domain transfer when a related domain is rich, and bandits to keep exploring once the model is confident. This post walks through that stack, anchored to the papers it descends from.

Recommendation Systems (13): Fairness, Debiasing, and Explainability

Tue, 06 Jan 2026 09:00:00 +0000

A user opens Spotify and the same fifty songs keep appearing. They open Amazon and the top results are always the items they have already considered. They open YouTube and every recommendation is one click away from a rabbit hole they cannot remember asking for. Each of these symptoms has a name, a cause, and a fix. This article is about all three.

What You Will Learn

The seven biases that systematically distort what users see, where each one comes from, and how to measure it
Causal inference for recommenders — why correlations from logged data lie, and how IPS, doubly robust estimators, and propensity scoring give you unbiased signal
Production-grade debiasing: MACR for popularity bias, DICE for conformity bias, FairCo for amortized exposure fairness
Counterfactual fairness and adversarial training to keep protected attributes out of embeddings
Explainability that holds up under audit: LIME, SHAP, and counterfactual explanations
A working trade-off framework so you can pick where to operate on the accuracy–fairness Pareto frontier

Prerequisites

Embedding-based recommenders (Part 4 and Part 5 )
Basic causal inference vocabulary helps but is not required — we build it from scratch
Comfortable reading PyTorch-style pseudocode

Part 1 — The Seven Biases

Bias in a recommender is not one problem. It is at least seven, and they compound. Below is the working taxonomy used in the survey of Chen et al. (2023, Bias and Debias in Recommender System) — the cleanest reference if you want the full literature map.

Recommendation Systems (12): Large Language Models and Recommendation

Sat, 03 Jan 2026 09:00:00 +0000

A user opens a movie app and types: “Something like Inception, but less depressing.” A traditional recommender — collaborative filtering, two-tower DNN, even DIN — sees zero useful tokens here. It has no like button to count, no co-watch graph to traverse, no user ID with history. The query has to be turned into IDs before the system can do anything.

A Large Language Model has the opposite problem: it has too much world knowledge but doesn’t know who this user is. It knows Inception is a Christopher Nolan film with non-linear narrative and a hopeful-but-ambiguous ending; it knows what “depressing” means in cinema; it can name twenty films that fit. But it can’t tell you which of those twenty the current user has already seen, rated badly, or left half-watched.

AI Agents Complete Guide: From Theory to Industrial Practice

Wed, 31 Dec 2025 09:00:00 +0000

A chatbot answers questions. An agent gets things done – it browses, runs code, calls APIs, queries databases, and iterates until the job is finished. The same LLM sits behind both, but the wrapper is different: an agent runs inside a loop with tools, memory, and the ability to inspect its own work.

This guide is the long-form version of that idea. It covers the four core capabilities (planning, memory, tool use, reflection), the major framework families, multi-agent collaboration, evaluation, and the production concerns that decide whether an agent ships or quietly fails on a Tuesday afternoon.

Recommendation Systems (11): Contrastive Learning and Self-Supervised Learning

Wed, 31 Dec 2025 09:00:00 +0000

Classical recommenders learn from one signal: did a user click, watch, or buy? That signal is precious, but it is also brutally sparse. Most users touch fewer than 1% of the catalogue, most items are touched by fewer than 0.1% of users, and a brand-new item or user has nothing at all. Optimising a model directly against such sparse labels almost guarantees overfitting on the head and silence on the tail.

Recommendation Systems (10): Deep Interest Networks and Attention Mechanisms

Sun, 28 Dec 2025 09:00:00 +0000

A good chef doesn’t cook the same dish for every guest. She watches you walk in, notes the wine you order, glances at how you eyed the chalkboard — and only then decides whether tonight’s special should be the steak or the risotto. Your past visits matter, but only the parts that fit this mood.

A recommendation model used to be a worse chef. It would take everything the user had ever clicked, average it into a single vector, and serve the same dish to everyone in the room. That vintage leather jacket you viewed last week and the random phone charger you clicked six months ago carried equal weight, regardless of what you were looking at right now.

Recommendation Systems (9): Multi-Task Learning and Multi-Objective Optimization

Thu, 25 Dec 2025 09:00:00 +0000

A live e-commerce ranker is never optimizing one number. The same model that decides which product to show you is, in the same forward pass, predicting whether you will click, whether you will add it to cart, whether you will pay, whether you will return it, and whether you will leave a positive review. Each prediction is a different task with its own data distribution, its own scarcity, and its own incentives. They are also tightly coupled: a clicker is more likely to convert, a converter is more likely to write a review, and a high-CTR thumbnail can buy clicks that depress watch time.

Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation

Mon, 22 Dec 2025 09:00:00 +0000

When you search for The Dark Knight on a streaming platform, the system does not merely log that you watched it. It knows Christian Bale played Batman, Christopher Nolan directed it, it belongs to the Batman trilogy, and it shares cinematic DNA with other cerebral action films. This rich semantic web is a knowledge graph (KG) – a structured network of entities (movies, actors, directors, genres) connected by typed relations (acted_in, directed_by, part_of).

Recommendation Systems (7): Graph Neural Networks and Social Recommendation

Fri, 19 Dec 2025 09:00:00 +0000

When Netflix decides what to recommend next, it does not look at your watch history in isolation. Behind the scenes there is a web of relationships: movies that share actors, users with overlapping taste, ratings that ripple through the catalogue. The “graph” view is not a metaphor — every interaction matrix is a graph, and treating it as one unlocks ideas that flat user/item embeddings cannot express.

Graph neural networks (GNNs) are the tool that lets us reason over that graph. Instead of learning each user and each item in isolation, a GNN says: your representation is shaped by the company you keep. That single shift powers Pinterest’s billion-node PinSage, the strikingly simple LightGCN that beats heavier baselines on collaborative filtering, and the social-recommendation systems that fuse “what you watched” with “what your friends watched.”

Recommendation Systems (6): Sequential Recommendation and Session-based Modeling

Tue, 16 Dec 2025 09:00:00 +0000

When you scroll TikTok, every recommendation feels eerily on-point — not because the system reads your mind, but because it reads the order of what you just watched. A cooking video followed by a travel vlog tells a different story than the same two clips in reverse. That ordering is exactly the signal that sequential recommenders are built to exploit.

Compare two friends recommending shows. The first knows your favourite genres but never asks what you watched last week. The second says, “You just finished three sci-fi thrillers in a row — try this one.” Traditional collaborative filtering is friend one. Sequential recommendation is friend two.

Recommendation Systems (5): Embedding and Representation Learning

Sat, 13 Dec 2025 09:00:00 +0000

When Netflix suggests Inception to someone who just finished The Dark Knight, the magic is not a hand-crafted “if-watched-Nolan-then” rule. It is geometry. Both films sit close together in a 128-dimensional embedding space that the model has learned from billions of viewing events. Geometry replaces enumeration: instead of comparing a movie to fifteen thousand others through brittle similarity rules, the system asks a single question — how far apart are these two vectors?

Recommendation Systems (4): CTR Prediction and Click-Through Rate Modeling

Wed, 10 Dec 2025 09:00:00 +0000

Every time you scroll through a social-media feed, click a product recommendation, or watch a suggested video, a CTR (click-through rate) model decided what to show you. These models answer one deceptively small question:

“What is the probability that this specific user will click on this specific item, right now?”

Behind that question is one of the most economically valuable problems in machine learning. A 1% lift in CTR translates into millions of dollars at Google, Amazon, or Alibaba scale – and the same models also drive video feeds, app stores, news apps, and dating apps. CTR prediction sits at the heart of the ranking stage: candidate generation gives you a few thousand items, and the CTR model decides which dozen actually reach the user.

Recommendation Systems (3): Deep Learning Foundations

Sun, 07 Dec 2025 09:00:00 +0000

In June 2016, Google published a one-page paper that quietly redrew the map of recommendation systems. The paper described Wide & Deep Learning, the model then powering app recommendations inside Google Play – a billion-user product. Within a year, every major tech company had a deep model in production. By 2019, the industry standard had shifted: matrix factorization was a baseline, not a system.

What changed? Multi-layer neural networks brought four capabilities classical methods could not deliver:

Recommendation Systems (2): Collaborative Filtering and Matrix Factorization

Thu, 04 Dec 2025 09:00:00 +0000

You finish The Shawshank Redemption and want something with the same feeling. A genre filter would surface every prison drama ever made, most of them awful. Collaborative filtering takes a different route: it never looks at the movie itself. It looks at people who watched what you watched and asks what else they loved.

That single idea — let the crowd’s behaviour speak — powers Amazon, YouTube, Spotify and every modern feed. This article unpacks the algorithms behind it, from the neighbourhood methods of the 1990s to the matrix-factorization models that won the Netflix Prize.

Recommendation Systems (1): Fundamentals and Core Concepts

Mon, 01 Dec 2025 09:00:00 +0000

Open Netflix and the homepage somehow knows you. Scroll TikTok and the next video is the one you didn’t realise you wanted. Drop into Spotify on a Monday morning and Discover Weekly serves up thirty songs you’ve never heard of, and you save half of them.

None of this is magic. It is one of the most commercially successful applications of machine learning, quietly running behind almost every consumer product you use: the recommendation system.

NLP (12): Frontiers and Practical Applications

Tue, 25 Nov 2025 09:00:00 +0000

We have spent eleven chapters climbing from raw text to multimodal foundation models. This twelfth and final chapter sits at the frontier and at the runway. It is where research stops being a paper and starts being a service: an LLM that calls tools, writes and debugs code, reasons through hundred-step problems, ingests a 200K-token contract, and serves a thousand concurrent users behind a FastAPI endpoint with p95 latency under 300 ms.

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It cannot read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect and factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update at any time. Before generating, retrieve the relevant evidence and condition the model on it.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

NLP (8): Model Fine-tuning and PEFT

Wed, 05 Nov 2025 09:00:00 +0000

In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.’s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.’s QLoRA (NeurIPS 2023).

The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to “have a model.” Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.

NLP (7): Prompt Engineering and In-Context Learning

Fri, 31 Oct 2025 09:00:00 +0000

The same model can produce a sharp answer or a confident hallucination. The difference is rarely the weights – it is the framing. A vague request like “analyze this text” gets you a generic summary; a prompt with a role, two clean examples, and a strict output schema gets you something a parser can consume. Prompt engineering is the discipline of turning that gap into a repeatable system instead of a lucky shot.

NLP Part 6: GPT and Generative Language Models

Sun, 26 Oct 2025 09:00:00 +0000

When you ask ChatGPT a question and a fluent multi-paragraph answer streams back token by token, you are watching a single deceptively simple loop: feed everything-so-far into a Transformer decoder, look at the probability distribution it produces over the vocabulary, pick one token, append it, repeat. That is all an autoregressive language model does. The miracle is not the loop – it is what happens when you scale the network behind the loop to hundreds of billions of parameters and train it on most of the internet.

NLP Part 5: BERT and Pretrained Models

Tue, 21 Oct 2025 09:00:00 +0000

In October 2018, Google released BERT and broke eleven NLP benchmarks at once. The recipe is almost embarrassingly simple: take a Transformer encoder, train it to predict words that have been randomly hidden using both left and right context, and then fine-tune the same pretrained model for whatever downstream task you have. Before BERT, every task came with its own from-scratch model. After BERT, “pretrain once, fine-tune everywhere” became the default mental model for the entire field.

NLP Part 4: Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

Prompt Engineering Complete Guide: From Zero to Advanced Optimization

Wed, 15 Oct 2025 09:00:00 +0000

The same model, two prompts: one gets 17% accuracy on grade-school math, the other gets 78%. The difference is not magic — it is prompt engineering. This guide shows you the techniques that work, the research behind them, and how to systematically optimize prompts for production.

What you will learn

Foundations — zero-shot, few-shot, many-shot, task decomposition, and the five-block prompt skeleton.
Reasoning techniques — Chain-of-Thought, Self-Consistency, Tree of Thoughts, Graph of Thoughts, ReAct.
Automation — Automatic Prompt Engineering (APE), DSPy, LLMLingua compression.
Practical templates — structured output, code generation, data extraction, multi-turn chat.
Evaluation and debugging — metrics, A/B testing, error analysis, the failure-mode toolkit.

Prerequisites. Basic Python; experience calling any LLM API. No math background required.

NLP Part 3: RNN and Sequence Modeling

Sat, 11 Oct 2025 09:00:00 +0000

Open Google Translate, swipe-type a message, dictate a memo to your phone — every one of these systems must consume an ordered stream of tokens and produce another. A feed-forward network treats each input independently, but language is fundamentally sequential: the meaning of “mat” in the cat sat on the mat depends on every word that came before. Recurrent Neural Networks (RNNs) handle this by maintaining a hidden state that evolves as they consume each token. The hidden state is the network’s running summary of the past — its memory.

NLP Part 2: Word Embeddings and Language Models

Mon, 06 Oct 2025 09:00:00 +0000

For decades, machines treated “king” and “queen” as unrelated symbols – nothing more than two distinct slots in a vocabulary list. Then a single idea changed everything: what if every word lived in a continuous space, and meaning was just a direction? Once that idea took hold, models could compute

$$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$$

and the entire trajectory of NLP turned toward representation learning. This article walks through that turn – from the failure of one-hot vectors, to Word2Vec’s shallow networks, to the global statistics that GloVe exploits, to the subword n-grams that let FastText handle words it has never seen – and finally connects embeddings to the language models that gave rise to them.

NLP Part 1: Introduction and Text Preprocessing

Wed, 01 Oct 2025 09:00:00 +0000

Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you are touching a stack that took seventy years to assemble. Natural Language Processing is the discipline that taught machines to read, score, transform, and write human language – and the surprising thing is how much of the modern stack still rests on a small set of preprocessing primitives invented decades ago.

Reinforcement Learning (12): RLHF and LLM Applications

Thu, 25 Sep 2025 09:00:00 +0000

GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it — and yet, asked a plain question, it would happily ramble, refuse for the wrong reasons, hallucinate citations, or produce a paragraph of toxicity. The two and a half years between them were not spent on bigger transformers. They were spent learning how to ask the model to be useful — and that turned out to be a reinforcement-learning problem.

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

Mon, 22 Sep 2025 09:00:00 +0000

Real data matrices are almost never both square and full rank: correlated features, too few samples, and noise-induced ill-conditioning all make “matrix inverse” either undefined or numerically useless. The pseudoinverse (Moore-Penrose inverse) preserves the spirit of an inverse while dropping the impossible-to-meet requirements: it redefines the “solution” of a linear system as the least-squares solution, breaking ties by picking the one with minimum norm. This post derives the pseudoinverse from that least-squares viewpoint, gives the four Penrose conditions, builds it from the SVD, and connects this single object to the Eckart-Young low-rank approximation theorem, PCA, recommender-system matrix factorization, and LoRA fine-tuning.

Reinforcement Learning (11): Hierarchical RL and Meta-Learning

Sat, 20 Sep 2025 09:00:00 +0000

Standard RL treats every problem as a flat sequence of atomic decisions: observe state, pick an action, receive a reward, repeat. That works when the horizon is short and rewards are dense, but it breaks down on the kind of tasks humans solve effortlessly. “Make breakfast” is not one decision; it is a tree of subtasks — brew coffee, fry eggs, toast bread, plate it up — each of which is itself a small policy. Hierarchical RL (HRL) lets agents reason and act at multiple timescales by treating macro-actions as first-class citizens.

Reinforcement Learning (10): Offline Reinforcement Learning

Mon, 15 Sep 2025 09:00:00 +0000

Every algorithm we have studied so far has the same loop at its core: act, observe, update. That loop is what makes RL work, but it is also what stops RL from being deployed. A self-driving stack cannot rehearse intersections by crashing into them. A clinical decision-support model cannot run a randomized policy on actual patients. A factory robot cannot try ten thousand grasp variants on a production line.

What these settings do have is logs – millions of hours of human driving, decades of de-identified patient records, terabytes of behavior cloning data. Offline RL (also called batch RL) is the subfield that asks: can we squeeze a strong policy out of a fixed dataset, with zero new interaction with the environment?

Reinforcement Learning (9): Multi-Agent Reinforcement Learning

Wed, 10 Sep 2025 09:00:00 +0000

Single-agent RL rests on one quiet but enormous assumption: the environment is stationary. The transition kernel does not change while the agent learns. The moment a second learner shares the world, that assumption collapses. Each agent now sees an environment whose dynamics shift as its peers update, rewards become entangled across agents, and the joint action space explodes combinatorially. These are not engineering nuisances. They are the reason multi-agent RL needs its own algorithms instead of just running DQN n times in parallel.

Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search

Fri, 05 Sep 2025 09:00:00 +0000

In March 2016, AlphaGo defeated world Go champion Lee Sedol 4–1 in Seoul. The result was not just a sporting upset; it was the moment a 60-year programme in artificial intelligence — beating the world’s best at Go — concluded a full decade ahead of most published predictions. Go has roughly $10^{170}$ legal positions, more than the number of atoms in the observable universe. No amount of brute-force search will ever crack it. AlphaGo’s victory came from a different idea: let a deep network supply the intuition about which moves look promising, and let Monte Carlo Tree Search (MCTS) supply the deliberation that verifies and sharpens that intuition.

Reinforcement Learning (7): Imitation Learning and Inverse RL

Sun, 31 Aug 2025 09:00:00 +0000

Every algorithm in the previous chapters assumed access to a reward function. In practice, designing that reward is often the hardest part of an RL project. Try writing one paragraph that captures “drive like a careful human”, “fold a shirt the way a tailor would”, or “summarise this document the way an expert editor would”. You can show those behaviours far more easily than you can specify them.

Imitation learning takes that intuition seriously: instead of optimising a hand-engineered scalar, it learns from expert demonstrations $\mathcal{D} = \{(s_t, a_t)\}$. This chapter walks the four canonical methods – behavioral cloning, DAgger, maximum-entropy IRL, and GAIL/AIRL – not as isolated tricks but as a single ladder where each rung relaxes one assumption and pays for it with new structure.

Reinforcement Learning (6): PPO and TRPO -- Trust Region Policy Optimization

Tue, 26 Aug 2025 09:00:00 +0000

Policy gradients (Part 3) optimise the policy directly, sidestepping discrete argmax operators and naturally handling stochastic strategies. They have one fatal flaw: a single overlong step can destroy the policy, and because the data distribution is coupled to the policy, recovery is nearly impossible.

Trust-region methods make this concrete: bound the change in behaviour, not in parameters, at every update. TRPO does it through a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The cheaper trick won: PPO trains OpenAI Five, ChatGPT’s RLHF stage, almost every modern robotics policy, and remains the workhorse of applied deep RL.

Reinforcement Learning (5): Model-Based RL and World Models

Thu, 21 Aug 2025 09:00:00 +0000

Every algorithm we have covered so far – DQN, REINFORCE, A2C, PPO, SAC – is model-free: the agent treats the environment as a black box, throws actions at it, and updates its policy from the rewards that come back. The approach works, but it is profligate. DQN needs roughly 10 million frames to master Atari Pong. OpenAI Five trained on Dota 2 for the equivalent of ~45,000 years of self-play. AlphaStar consumed years of StarCraft for a single agent.

Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning

Sat, 16 Aug 2025 09:00:00 +0000

Drop a fresh agent into Montezuma’s Revenge. To score a single point it must walk to the right, jump a skull, climb a rope, leap to a platform, and grab a key – roughly a hundred precise actions in a row. Until that key is collected, every reward signal is exactly zero.

A textbook DQN with $\varepsilon=0.1$ exploration has, by a generous estimate, a $0.1^{100} \approx 10^{-100}$ chance of stumbling onto that key by accident. Unsurprisingly, vanilla DQN scores 0 on this game. Not “low” – literally zero, every episode, for the entire training run.

Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods

Mon, 11 Aug 2025 09:00:00 +0000

DQN proved that deep RL can master Atari, but it has a hard ceiling: it only works in discrete action spaces. Ask it to control a robot arm with seven continuous joint angles and it falls apart – you would have to solve an inner optimisation problem every time you choose an action.

Policy gradient methods take a fundamentally different route. Instead of learning a value function and deriving a policy from it, they directly optimise the policy. That single change opens the door to continuous actions, stochastic strategies, and problems where the optimal play is itself random (think rock-paper-scissors).

Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)

Wed, 06 Aug 2025 09:00:00 +0000

In December 2013, a small DeepMind team uploaded a paper to arXiv with a striking claim: a single neural network, trained from raw pixels and the score, learned to play seven Atari games – and beat the previous best on six of them. No game-specific features. No hand-coded heuristics. The same architecture for Pong, Breakout, and Space Invaders. The algorithm was Deep Q-Network (DQN), and it kicked off the deep reinforcement learning era.

Reinforcement Learning (1): Fundamentals and Core Concepts

Fri, 01 Aug 2025 09:00:00 +0000

The first time you sat on a bicycle, nobody handed you a manual that said “if your tilt angle exceeds 7.4 degrees, apply 12% counter-steer.” You wobbled, you over-corrected, you fell, you got back on. After a few hundred attempts your body simply knew what to do, even though you could not put it into words.

That trial-feedback-improvement loop is not just how we learn to ride bikes. It is how AlphaGo learned to defeat the world Go champion, how Boston Dynamics robots learn to walk, and how recommendation systems quietly improve every time you click. They all share one mathematical framework called reinforcement learning (RL).

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Thu, 24 Jul 2025 09:00:00 +0000

The moment your model contains a sampling step, training hits a hard wall: how do gradients flow through a random node?

The reparameterization trick has a clean answer — rewrite $z\sim p_\theta(z)$ as $z=g_\theta(\epsilon)$, isolating the randomness in a parameter-free noise variable $\epsilon$, so backprop can flow through $g_\theta$. The trouble starts with discrete variables: operations like $\arg\max$ are not differentiable. Gumbel-Softmax (a.k.a. the Concrete distribution) replaces the discrete sample with a tempered softmax over Gumbel-perturbed logits, giving you a smooth, differentiable surrogate that you can train end-to-end.

Transfer Learning (12): Industrial Applications and Best Practices

Sun, 06 Jul 2025 09:00:00 +0000

This is the final part of the series. The previous eleven parts gave you the mechanics – pretraining, fine-tuning, domain adaptation, few-shot and zero-shot learning, distillation, multi-task learning, multimodality, parameter-efficient methods, continual learning, and cross-lingual transfer. This part is about the work that happens once the notebook closes: deciding whether to use transfer learning, how to thread it into a production pipeline, and how to know it is still working six months later.

Transfer Learning (11): Cross-Lingual Transfer

Mon, 30 Jun 2025 09:00:00 +0000

English has the labels. The world has 7,000+ languages. Cross-lingual transfer is what lets a sentiment classifier trained only on English IMDB reviews score Spanish tweets, what makes a question-answering model fine-tuned on SQuAD answer Hindi questions, and what allows a model that has never seen a single labeled Swahili sentence to do passable Swahili NER.

This post derives why that is even possible. We start from the bilingual-embedding alignment that motivated the field, walk through the multilingual pretraining recipe (mBERT, XLM-R) that made parallel data optional, and end with the practical playbook – zero-shot vs translate-train vs translate-test, when to pick which, and where the wheels come off.

Transfer Learning (10): Continual Learning

Tue, 24 Jun 2025 09:00:00 +0000

You can teach yourself to play guitar this year and you will still remember how to ride a bike. A neural network cannot. Fine-tune a vision model on CIFAR-10 then on SVHN, evaluate it on CIFAR-10 again, and accuracy collapses to barely above chance. The phenomenon is called catastrophic forgetting, and overcoming it is the central problem of continual learning (CL): a learner that absorbs a stream of tasks $\mathcal{T}_1, \mathcal{T}_2, \ldots$ without re-accessing past data and without losing what it already knew.

LLM Workflows and Application Architecture: Enterprise Implementation Guide

Sat, 21 Jun 2025 09:00:00 +0000

Most LLM tutorials end where the interesting work begins. They show you how to call a chat completion endpoint, attach a vector store, and wrap the whole thing in a Streamlit demo. None of that is wrong, but none of it is what breaks at 3 a.m. when 10,000 users hit your service at once and every other answer is a hallucination.

This article is about everything that comes after the demo. It is opinionated on purpose: production LLM systems are mostly plain distributed systems with one non-deterministic component bolted on, and most of the engineering effort goes into containing that non-determinism. We will work through seven dimensions — application architecture, workflow patterns, the RAG-vs-fine-tune decision, deployment topology, cost, observability, and enterprise integration — keeping each one short, concrete, and grounded in the levers that actually move the needle.

Symplectic Geometry and Structure-Preserving Neural Networks

Sat, 21 Jun 2025 09:00:00 +0000

Train a vanilla feedforward network to predict a one-dimensional harmonic oscillator. Validate it on the next ten time steps – the error is fine. Now roll it out for a thousand steps. The orbit no longer closes, the energy creeps upward, and what should be a periodic motion turns into a slow spiral. The network learned to fit data points; it never learned the physics. Structure-preserving networks fix this by baking geometric invariants – energy conservation, the symplectic 2-form, the Euler-Lagrange equations – directly into the architecture, so the learned model cannot violate them no matter how long you integrate.

Transfer Learning (9): Parameter-Efficient Fine-Tuning

Wed, 18 Jun 2025 09:00:00 +0000

How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible – and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.

What You Will Learn

Why the low-rank assumption holds for weight updates
LoRA: derivation, initialization, scaling, and weight merging
Adapter: bottleneck architecture and where to insert it
Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2
QLoRA: how 4-bit quantisation gets a 65B model on one GPU
Method comparison and a selection guide grounded in GLUE numbers

Prerequisites

Transformer architecture (attention, FFN, residual + LayerNorm)
Matrix decomposition basics (rank, SVD)
Transfer learning fundamentals (Parts 1-6)

The Full Fine-Tuning Problem

Full fine-tuning updates every parameter $\boldsymbol{\theta}$:

Transfer Learning (8): Multimodal Transfer

Thu, 12 Jun 2025 09:00:00 +0000

How can a model classify an image of a Burmese cat correctly without ever having seen a label “Burmese cat”? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then “classification” reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.

Transfer Learning (7): Zero-Shot Learning

Fri, 06 Jun 2025 09:00:00 +0000

You have never seen a zebra. I tell you it looks like a horse painted with black and white stripes, and the next time one walks into the zoo you recognise it instantly. No labelled examples, no fine-tuning — only a semantic bridge between what you know (horses, stripes) and what you don’t (this new species).

Zero-shot learning (ZSL) is the machine-learning version of that trick. Train on a set of seen classes for which you have labelled images. At test time, classify into a disjoint set of unseen classes that you have never shown the model — using only a description of what those classes are: a list of attributes, a word embedding of the class name, a sentence, or an image-text contrastive prompt. The model’s only handle on the unseen classes is the geometry it has learned in a shared visual–semantic space.

Transfer Learning (6): Multi-Task Learning

Sat, 31 May 2025 09:00:00 +0000

A self-driving car looking through a single camera needs to do three things at once: detect cars and pedestrians, segment lanes and free space, and estimate how far away each pixel is. You could train three separate networks. You would burn 3x the parameters, run 3x the forward passes at inference, and ignore the obvious fact that all three tasks need the same kind of low-level features (edges, surfaces, occlusion cues).

Transfer Learning (5): Knowledge Distillation

Sun, 25 May 2025 09:00:00 +0000

You have a 340M-parameter BERT model that hits 95% accuracy. The product team wants it on a phone that can barely fit 10M parameters. Training a 10M model from scratch lands at 85%. Knowledge distillation closes most of the gap: train the small model on the output distribution of the large one, not just on the labels, and you can reach 92%.

The key insight, due to Hinton, is that a teacher’s “wrong” predictions are not noise – they are information. When the teacher classifies a cat image and assigns 0.14 to “tiger”, 0.07 to “dog”, and 0.008 to “plane”, it is telling you that cats look a lot like tigers, somewhat like dogs, and nothing like aeroplanes. That structure – dark knowledge – is invisible in a one-hot label, and learning it is what lets the student punch above its weight.

Transfer Learning (4): Few-Shot Learning

Mon, 19 May 2025 09:00:00 +0000

Show a child one photograph of a pangolin and they will spot pangolins for life. Show a deep learning model one photograph and it will give you a uniformly random guess. Few-shot learning is the field that closes that gap: building classifiers that work with only one to ten labeled examples per class.

The trick is not to memorize individual classes harder. It is to learn how to learn from very few examples, then carry that ability over to brand-new classes at test time. This article covers the two families that dominate the field today: metric learning, which learns a good distance function, and meta-learning, which learns a good initialization.

Transfer Learning (3): Domain Adaptation

Tue, 13 May 2025 09:00:00 +0000

Your autonomous-driving stack works perfectly on sunny California freeways. Then it rains in Seattle. Top-1 accuracy drops from 95% to 70%. The model did not get worse — the data distribution shifted, and your training set never told it what wet asphalt looks like at dusk.

This is the everyday problem of domain adaptation: you have abundant labelled data in one distribution (the source) and unlabelled data in another (the target), and you need the model to perform on the target. This article shows you how, from first-principles theory to a working DANN implementation.

Transfer Learning (2): Pre-training and Fine-tuning

Wed, 07 May 2025 09:00:00 +0000

BERT changed NLP overnight. A model pre-trained on Wikipedia and BookCorpus could be fine-tuned on a few thousand labelled examples and beat task-specific architectures that researchers had spent years hand-crafting. The same pattern repeated in vision (ImageNet pre-training, then SimCLR, MAE), in speech (wav2vec 2.0), and in code (Codex). Today, “pre-train once, fine-tune everywhere” is the default recipe of modern deep learning.

But why does pre-training work? When should you freeze layers, when should you LoRA, and how small does your learning rate need to be? This article unpacks both the theory and the engineering practice behind the most successful transfer paradigm we have.

Transfer Learning (1): Fundamentals and Core Concepts

Thu, 01 May 2025 09:00:00 +0000

You spent two weeks training an ImageNet classifier on a rack of GPUs. On Monday morning your team lead asks for a chest-X-ray pneumonia model – and the entire labelled dataset is two hundred images. Do you book another two weeks of GPU time and start from scratch?

Of course not. You take what the ImageNet model already knows about edges, textures and shapes, swap out the last layer, and fine-tune on the X-rays. Two hours later you have a model that beats anything you could have trained from random weights with so little data. That is transfer learning, and it is the reason most real-world deep-learning projects ship in days instead of months.

Essence of Linear Algebra (18): Frontiers and Summary

Wed, 30 Apr 2025 09:00:00 +0000

We have walked the long road of linear algebra together. We started with arrows in the plane and ended at the gates of quantum computers, the inner workings of large language models, and the topology of data clouds. The remarkable thing – the thing this series has tried to make visible – is that the same handful of ideas keeps coming back. A vector is a state. A matrix is a transformation. A decomposition is the structure hiding inside the transformation. A norm tells you when you can trust your computation. Once you internalise that loop, every “frontier” looks less like a foreign country and more like another dialect of a language you already speak.

Essence of Linear Algebra (17): Linear Algebra in Computer Vision

Wed, 23 Apr 2025 09:00:00 +0000

Computer vision is the science of teaching machines to see. What is striking is how thoroughly the whole field reduces to linear algebra: an image is a matrix, a geometric transformation is a matrix product, a camera is a $3 \times 4$ projection matrix, two-view geometry is the equation $\mathbf{x}_2^\top \mathbf{F}\, \mathbf{x}_1 = 0$, and 3D reconstruction is a sparse linear least-squares problem. Once you see the field through that lens, what once looked like a zoo of algorithms turns out to be a small set of linear-algebraic ideas applied repeatedly.

Essence of Linear Algebra (16): Linear Algebra in Deep Learning

Wed, 16 Apr 2025 09:00:00 +0000

Strip away the marketing and a deep network is one thing: a long pipeline of matrix multiplications glued together by elementwise nonlinearities. Forward pass, backward pass, convolution, attention, normalization, fine-tuning – every “trick” is a small twist on the same algebraic theme. Once you see the matrices, the field stops looking like a bag of recipes and starts looking like a single language.

This chapter rebuilds the modern stack from that single language. We follow one signal – a vector $\mathbf{x}$ – as it flows through linear layers, gets convolved, gets attended to, gets normalized, and gets adapted by a low-rank update. At each step we name the matrix that does the work and the property of that matrix (rank, conditioning, transpose) that makes the trick succeed.

Essence of Linear Algebra (15): Linear Algebra in Machine Learning

Wed, 09 Apr 2025 09:00:00 +0000

Ask any senior ML engineer “what math do you actually use day to day?” and the answer is almost always linear algebra. Calculus shows up in derivations; probability shows up in modeling; but the runtime of a real ML system is dominated by matrix-vector multiplies, decompositions, and projections. PyTorch’s Linear, scikit-learn’s PCA, Spark MLlib’s ALS, and a Transformer’s attention head are all the same primitive in different costumes.

This chapter walks through the algorithms that production ML systems actually run – PCA, LDA, SVM with kernels, matrix factorization for recommenders, regularized linear regression, neural network layers, attention – and shows the linear algebra that makes each of them tick. We focus on intuition first, geometry second, formulas third.

Essence of Linear Algebra (14): Random Matrix Theory

Wed, 02 Apr 2025 09:00:00 +0000

A million i.i.d. coin flips, arranged into a thousand-by-thousand symmetric matrix, somehow produce eigenvalues that fill a perfect semicircle. A noisy sample covariance matrix that should be the identity instead spreads its eigenvalues across an interval whose width you can predict before seeing a single number. The largest eigenvalue of a Wigner matrix has a tail distribution that turns up everywhere – in growing crystals, in the longest increasing subsequence of a random permutation, in the energy levels of heavy nuclei. Random matrix theory (RMT) is the study of why these regularities appear, and how to use them.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Mon, 31 Mar 2025 09:00:00 +0000

Fine-tuning a 1.5B-parameter GPT-2 model for each downstream task means saving a fresh 1.5B-parameter checkpoint every time. Across a dozen tasks that is a substantial storage and serving headache, and it makes sharing a single base model essentially impossible. Prefix-Tuning (Li & Liang, 2021) takes the opposite stance: freeze every weight of the language model, and learn a tiny block of continuous vectors — the prefix — that is fed into the attention layers as if it were context the model already attended to. The model never changes; only the prefix does, and a different prefix produces a different “personality” on demand.

Essence of Linear Algebra (13): Tensors and Multilinear Algebra

Wed, 26 Mar 2025 09:00:00 +0000

If you’ve used PyTorch or TensorFlow, you’ve met the word “tensor” hundreds of times. PyTorch calls every array torch.Tensor; TensorFlow puts it in the product name. But what is a tensor, and why did frameworks borrow this physics-flavored word for what looks like a multi-dimensional array?

The short answer of this chapter:

A tensor is the natural generalization of a scalar, vector, and matrix to arbitrary dimensions. Everything you know about matrices either lifts cleanly to tensors, or breaks in instructive ways.

Sparse Matrices and Compressed Sensing -- Less Is More

Wed, 19 Mar 2025 09:00:00 +0000

The “Less Is More” Miracle

A raw 24-megapixel photograph weighs in at roughly 70 MB. JPEG compresses it to a few hundred kilobytes – a 100$\times$reduction – and you cannot tell the difference. A traditional MRI scan takes thirty minutes; a modern compressed sensing MRI gets the same image in five.

Both miracles run on the same engine: sparsity. Most natural signals, written in the right basis, have only a handful of meaningful coefficients. Everything else is essentially zero.

Matrix Calculus and Optimization -- The Engine Behind Machine Learning

Wed, 12 Mar 2025 09:00:00 +0000

From Shower Knobs to Neural Networks

Every morning you train a tiny neural network. The water comes out too cold, so you nudge the knob – a parameter – in some direction. A second later you observe a new temperature – the error signal – and nudge again. After three or four iterations you have converged.

Modern deep learning is the same loop, scaled up by seven orders of magnitude. The “knob” is a matrix$W$with hundreds of millions of entries. The “error” is a scalar loss$L$. And the question is the same: for each parameter, in which direction should I push, and by how much? The answer lives in a single object: the gradient$\partial L / \partial W$.

Matrix Norms and Condition Numbers -- Is Your Linear System Healthy?

Wed, 05 Mar 2025 09:00:00 +0000

The Question That Haunts Engineers

The equations are right. The algorithm is right. So why is the computed answer completely wrong?

The culprit is usually a single number called the condition number. It measures how sensitive a linear system is — whether a tiny wobble in the input gets amplified into a catastrophic error in the output. To talk about condition numbers we first need a way to measure the “size” of vectors and matrices. That is what norms do.

Singular Value Decomposition -- The Crown Jewel of Linear Algebra

Wed, 26 Feb 2025 09:00:00 +0000

Why SVD Earns the Crown

The spectral theorem of Chapter 8 gave us $A = Q\Lambda Q^T$ – a beautifully clean factorisation, but only for symmetric matrices. Most matrices that show up in practice are not symmetric, and many are not even square:

a photograph stored as a $1920 \times 1080$ pixel matrix,
a Netflix-style user–movie rating matrix (millions of rows, thousands of columns),
a document–term matrix in NLP (documents by vocabulary),
a gene-expression matrix in bioinformatics.

$$A = U\,\Sigma\,V^{\!\top}.$$

This is the most powerful, most universally applicable decomposition in all of linear algebra.

Symmetric Matrices and Quadratic Forms -- The Best Matrices in Town

Wed, 19 Feb 2025 09:00:00 +0000

Why Symmetric Matrices Are the “Best”

Of all the matrices you will ever meet, symmetric matrices are the most well-behaved. They have:

only real eigenvalues,
a complete set of orthogonal eigenvectors,
and a perfect diagonalization $A = Q\Lambda Q^T$ that costs nothing to invert.

This is not a curiosity. Almost every important matrix you actually compute with in physics, optimization, statistics, or machine learning is symmetric:

A covariance matrix $\Sigma = \tfrac{1}{n}X^TX$ records how features vary together. It is symmetric by construction.
A Hessian matrix $H_{ij} = \partial^2 f / \partial x_i \partial x_j$ records second derivatives. By Clairaut’s theorem, mixed partials commute, so $H$ is symmetric.
A stiffness matrix $K$ encodes how connected springs push on each other. Newton’s third law forces $K = K^T$.
A kernel or Gram matrix $G_{ij} = \langle x_i, x_j \rangle$ measures pairwise similarity. Inner products are symmetric, so $G$ is too.

This chapter explains why symmetry buys you so much, and how the geometry of quadratic forms lets you read off the behaviour of a symmetric matrix at a glance.

Orthogonality and Projections -- When Vectors Mind Their Own Business

Wed, 12 Feb 2025 09:00:00 +0000

Why Orthogonality Matters

Two vectors are orthogonal when they “do not interfere” with one another. That single idea – one direction tells you nothing about the other – powers GPS positioning, noise-canceling headphones, JPEG compression, recommendation systems, and most of numerical linear algebra.

Orthogonality is the single biggest computational shortcut in linear algebra. With a generic basis, finding coordinates is solving a linear system. With an orthogonal basis, finding coordinates is one dot product per axis. Hard problem, easy problem, same problem – just a better basis.

Eigenvalues and Eigenvectors

Wed, 05 Feb 2025 09:00:00 +0000

The Big Question

Apply a matrix to a vector and almost anything can happen. Most vectors get rotated and stretched, landing in a brand new direction. But scattered among them are a few special vectors that refuse to leave their span. They come out of the transformation pointing exactly the way they went in – only longer, shorter, or flipped.

These survivors are eigenvectors. The factor by which they get scaled is the eigenvalue.

Linear Systems and Column Space

Wed, 29 Jan 2025 09:00:00 +0000

The Central Question

Almost everything in applied mathematics eventually lands on the same question:

Given a matrix $A$ and a vector $\vec{b}$, does the equation $A\vec{x} = \vec{b}$ have a solution? If so, how many?

The mechanical answer is “row-reduce and look.” The structural answer is far more interesting – and it is the goal of this chapter. Three geometric objects tell you everything:

Column space $C(A)$ – the set of vectors $A$ can reach. It decides whether a solution exists.
Null space $N(A)$ – the set of vectors $A$ crushes to zero. It decides how many solutions exist.
Rank $r$ – the dimension of the column space. It quantifies how much information $A$ preserves.

Once these three are clear, every linear-systems result – existence, uniqueness, least squares, the four fundamental subspaces – becomes the same story told from different angles.

The Secrets of Determinants

Wed, 22 Jan 2025 09:00:00 +0000

Beyond the Formula

In most classrooms, determinants are introduced as a formula to memorize:

$$\det\begin{pmatrix}a & b\\ c & d\end{pmatrix} = ad - bc$$

You plug in numbers, compute, and move on. That misses the point entirely.

Here is the real meaning, in one sentence:

The determinant of $A$ is the factor by which $A$ scales area (in 2D) or volume (in 3D).

Once you internalize this, every property of determinants stops being a rule to memorize and starts being something you can see. The product rule $\det(AB) = \det(A)\det(B)$ becomes obvious – two scalings compose multiplicatively. $\det(A) = 0$ means space gets crushed flat. $\det(A^{-1}) = 1/\det(A)$ says the inverse must undo the scaling. The sign of the determinant tells you whether orientation was preserved or flipped.

Matrices as Linear Transformations

Wed, 15 Jan 2025 09:00:00 +0000

The Big Idea

Open a traditional textbook and matrices show up as “rectangular arrays of numbers.” You learn rules for adding and multiplying them, but no one explains why the multiplication rule looks the way it does, or why $AB \neq BA$ in general.

Here is the secret the symbol-pushing version hides: a matrix is a function that transforms space. Every $m \times n$ matrix is a machine that eats an $n$-dimensional vector and spits out an $m$-dimensional one. Once you can see that, the strange rules stop being strange. They are simply the bookkeeping for what happens to the basis vectors.

Linear Combinations and Vector Spaces

Wed, 08 Jan 2025 09:00:00 +0000

Why This Chapter Matters

Open a box of crayons that contains only red, green, and blue. How many colors can you draw? The honest answer is infinitely many — every shade you have ever seen on a screen is just a different mix of those three. Three “ingredients” produce an entire universe.

That recipe — take a few vectors, scale them, add them up — is called a linear combination. The whole of linear algebra is built on this one move. Once you understand it deeply, you also understand:

The Essence of Vectors -- More Than Just Arrows

Wed, 01 Jan 2025 09:00:00 +0000

Why Vectors, and Why Care?

A physicist talks about a force. A data scientist talks about a feature. A game programmer talks about a velocity. A quantum theorist talks about a state. Different worlds, different languages – but the same underlying object: a vector.

That is not a coincidence. A vector is the smallest piece of mathematics flexible enough to describe anything you can add together and scale. Once you spot that pattern, you spot it everywhere.

Time Series Forecasting (8): Informer -- Efficient Long-Sequence Forecasting

Sun, 15 Dec 2024 09:00:00 +0000

The Transformer is wonderful at sequence modeling – right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.

Informer (Zhou et al., AAAI 2021 best paper) is the architecture that finally made Transformers practical for these settings. It does three things, each of which would be a contribution on its own:

Vim Essentials: Modal Editing, Motions, and a Repeatable Workflow

Fri, 06 Dec 2024 09:00:00 +0000

Most people quit Vim because they try to memorize shortcuts. That is the wrong frame. Vim is a small language: learn the grammar – operator + motion – and you can express any edit without ever opening a cheat sheet again. This guide walks you through the 80% of Vim you will use daily, then shows how the remaining 20% composes naturally from the same handful of rules.

What you will learn

The single core idea: modes plus composable operations (operator + motion)
The handful of motions, text objects, and operators that cover almost everything
File operations, search & replace, macros, marks, registers
Buffers vs windows vs tabs – the mental model people most often get wrong
A minimal .vimrc and a one-week deliberate-practice plan to build muscle memory

Prerequisites

Any terminal (Vim ships with virtually every Unix-like system)
A willingness to feel slow for about a week

1. The core idea – modes plus a tiny grammar

Time Series Forecasting (7): N-BEATS -- Interpretable Deep Architecture

Sat, 30 Nov 2024 09:00:00 +0000

The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a pure neural network with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was N-BEATS by Oreshkin et al. – a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

Fri, 15 Nov 2024 09:00:00 +0000

For most of the 2010s, anyone who said “deep learning for time series” meant LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Their result was annoyingly simple: take a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps out exponentially (dilation), wrap the whole thing in residual connections, and train. On task after task, the resulting Temporal Convolutional Network (TCN) matched or beat LSTM/GRU – while training several times faster because every time step in the forward pass runs in parallel.

Time Series Forecasting (5): Transformer Architecture for Time Series

Thu, 31 Oct 2024 09:00:00 +0000

What You Will Learn

The full encoder-decoder Transformer, redrawn for time series
Why position must be injected, and how sinusoidal / learned / time-aware encodings differ
What multi-head attention actually learns over a temporal sequence
Where vanilla attention breaks down (O(n^2)) and the four families of fixes: sparse, linear, patched, decoder-only
A clean PyTorch reference implementation, plus when to reach for Autoformer / FEDformer / Informer / PatchTST

Prerequisites

Self-attention and multi-head attention (Part 4)
Encoder-decoder architectures and teacher forcing
PyTorch fundamentals (nn.Module, training loops)

1. Why Transformers for Time Series

LSTM and GRU process a sequence step by step. Three things follow from that:

Time Series Forecasting (4): Attention Mechanisms -- Direct Long-Range Dependencies

Wed, 16 Oct 2024 09:00:00 +0000

What you will learn

Why recurrent models hit a wall on long-range dependencies, and how attention removes it.
The Query / Key / Value mechanism, scaled dot-product attention, and the role of $1/\sqrt{d_k}$.
Two classic scoring functions – Bahdanau (additive) and Luong (multiplicative).
How to wire attention into an LSTM encoder/decoder for time series.
Multi-head attention specialised for time – different heads for recency, period, anomaly.
The $O(n^2)$ memory wall and how sparse / linear attention bypass it.
A worked stock-prediction case with attention-weight overlays.

Prerequisites: RNN/LSTM/GRU intuition (Parts 2-3), basic linear algebra, PyTorch.

MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation

Sat, 12 Oct 2024 09:00:00 +0000

LoRA is the default tool for adapting a frozen base model: cheap, stable, mergeable, and good enough for most single-task settings. But the moment your fine-tuning data is genuinely heterogeneous – code mixed with math, instruction following mixed with creative writing, several domains in one adapter – a single low-rank subspace starts to feel cramped. You can grow $r$, but cost grows with it and you still get one subspace, just a fatter one.

Tennis-Scene Computer Vision: From Paper Survey to Production

Mon, 07 Oct 2024 09:00:00 +0000

A 6.7 cm tennis ball travels at over 200 km/h. Reconstructing its 3D trajectory from eight 4K cameras in real time, while simultaneously classifying what stroke each player is making, is a system problem that touches small-object detection, multi-view geometry, Kalman filtering, physics modelling, and human-pose estimation — all at once. This post walks the same path you’d walk at deployment time: state the constraints, survey the literature, choose, then build, and finally lay out a millisecond-by-millisecond budget for what runs in production.

Time Series Forecasting (3): GRU -- Lightweight Gates and Efficiency Trade-offs

Tue, 01 Oct 2024 09:00:00 +0000

What You Will Learn

How GRU’s update gate $z_t$ and reset gate $r_t$ achieve LSTM-quality memory with one fewer gate and one fewer state.
Why GRU has exactly 25% fewer parameters than LSTM, and what that buys you in practice.
How to read GRU gate activations to debug what the model is paying attention to.
A practical decision matrix for picking GRU vs LSTM, backed by parameter, speed, and forecast-quality benchmarks.
A clean PyTorch reference implementation with the regularisation and stability tricks that actually matter.

Prerequisites

Comfort with the LSTM gates from Part 2 .
Basic PyTorch (nn.Module, autograd, optimizers).
Recall that gradient flow through tanh nonlinearities is what kills vanilla RNNs.

Figure 1. The GRU cell. Two gates (r, z) and one state (h) replace LSTM’s three gates and separate cell state. The orange (1 - z) ⊙ h_{t-1} skip path is the linear gradient highway that makes long-range learning tractable.

Time Series Forecasting (2): LSTM -- Gate Mechanisms and Long-Term Dependencies

Mon, 16 Sep 2024 09:00:00 +0000

What You Will Learn

Why vanilla RNNs fail on long sequences and how LSTM fixes the gradient problem
The intuition behind each gate (forget, input, output) and the cell-state “highway”
How to structure inputs/outputs for one-step and multi-step time series forecasting
Practical recipes: regularization, sequence length, bidirectional vs stacked LSTM, when to choose LSTM vs GRU

Prerequisites

Basic understanding of neural networks (forward pass, backpropagation)
Familiarity with PyTorch (nn.Module, tensors, optimizers)
Part 1 of this series (helpful but not required)

1. The Problem LSTM Solves

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b).$$$$\frac{\partial h_T}{\partial h_k} = \prod_{t=k+1}^{T} \mathrm{diag}\!\left(1 - h_t^2\right) W_h.$$

Two regimes appear:

Time Series Forecasting (1): Traditional Statistical Models

Sun, 01 Sep 2024 09:00:00 +0000

Next: LSTM Deep Dive –>

What You Will Learn

Why stationarity is the entry ticket for the whole ARIMA family, and how differencing buys it.
How to read ACF and PACF plots like a Box-Jenkins practitioner: cut-off vs. tail-off as the rule for identifying $p$ and $q$.
The full ARIMA / SARIMA machinery, including how seasonality is folded in via lag-$s$ operators.
Where VAR, GARCH, exponential smoothing, Prophet and the Kalman filter sit on the same map – mean dynamics vs. variance dynamics vs. state-space recursion.
A decision rule for when a traditional model is the right answer and when to graduate to the deep models in the rest of this series.

Prerequisites

Basic probability and statistics (mean, variance, covariance, correlation).
Familiarity with NumPy and pandas time indexes.
A little linear algebra for the VAR / Kalman sections (matrix multiplication, eigenvalues).

1. Why traditional models still matter

Before the deep-learning era, the time-series toolbox was already remarkably complete. ARIMA captures linear autocorrelation, SARIMA adds calendar effects, VAR generalises to vectors, GARCH models the variance, and the Kalman filter unifies the lot inside a state-space recursion. They share three properties that deep models do not give for free:

PDE and Machine Learning (8): Reaction-Diffusion Systems and Graph Neural Networks

Wed, 14 Aug 2024 09:00:00 +0000

What This Article Covers

Stack 32 layers of GCN on a citation graph and accuracy collapses from 81 % to 20 %. Every node converges to the same vector. This is over-smoothing, the GNN equivalent of heat death — and the diagnosis comes straight from PDE theory. A GCN layer is one explicit-Euler step of the heat equation on a graph, and the heat equation has exactly one fixed point: the constant. The cure was published in 1952. Alan Turing showed that adding a reaction term to a diffusion equation can make a uniform state spontaneously break apart into stripes, spots, or labyrinths. The same trick — a learned reaction term — keeps deep GNNs alive.

PDE and Machine Learning (7): Diffusion Models and Score Matching

Tue, 30 Jul 2024 09:00:00 +0000

What This Article Covers

Since 2020, diffusion models have become the dominant paradigm in generative AI. From DALL·E 2 to Stable Diffusion to Sora, their generation quality and training stability are unmatched by GANs and VAEs. Beneath this success lies a remarkably clean mathematical structure: diffusion models are numerical solvers for partial differential equations.

Adding Gaussian noise corresponds to integrating the Fokker–Planck equation forward in time.
Learning to denoise is equivalent to learning the score function $\nabla\log p_t$.
DDPM is a discretised reverse SDE; DDIM is the corresponding probability-flow ODE.
Stable Diffusion is the same machinery, executed in a low-dimensional latent space.

What you will learn

PDE and Machine Learning (6): Continuous Normalizing Flows and Neural ODE

Mon, 15 Jul 2024 09:00:00 +0000

What This Article Covers

Generative modeling reduces to one geometric question: how do you transform a simple distribution (a Gaussian) into a complex one (faces, molecules, motion)? Discrete normalizing flows stack invertible blocks, but each block needs a Jacobian determinant at $O(d^3)$ cost. Neural ODEs replace discrete depth with a continuous ODE; Continuous Normalizing Flows (CNF) then push densities through that ODE using the instantaneous change-of-variables formula, dropping density computation to $O(d)$. Flow Matching removes the divergence integral altogether and turns training into plain regression on a target velocity field.

PDE and Machine Learning (5): Symplectic Geometry and Structure-Preserving Networks

Sun, 30 Jun 2024 09:00:00 +0000

What this article covers

Train an unconstrained neural network on pendulum data and ask it to extrapolate. After a few seconds of integration the prediction is fine; after a minute the pendulum has either crept to a halt or, more often, accelerated to escape velocity. Energy was supposed to be conserved, but the network has no idea what energy is. The bug is not in the data, the optimizer, or the depth of the network. The bug is in the architecture. A standard MLP can represent any vector field, including unphysical ones, and a tiny systematic bias in that vector field is amplified into macroscopic energy drift over a long rollout.

PDE and Machine Learning (4): Variational Inference and the Fokker-Planck Equation

Sat, 15 Jun 2024 09:00:00 +0000

Seven Dimensions of This Article

Motivation: why VI and MCMC look different but solve the same PDE.
Theory: derivation of the Fokker-Planck equation from the SDE.
Geometry: KL divergence as a Wasserstein gradient flow.
Algorithms: Langevin Monte Carlo, mean-field VI, and SVGD.
Convergence: log-Sobolev inequality and exponential KL decay.
Numerical experiments: 7 figures with reproducible code.
Application: Bayesian neural networks via posterior sampling.

What You Will Learn

How the Fokker-Planck equation governs probability density evolution from any Itô SDE.
Langevin dynamics as a practical sampling algorithm and its discretization error.
Why minimizing $\mathrm{KL}(q\|p^\star)$ in Wasserstein space is the Fokker-Planck PDE.
The deep equivalence between variational inference and Langevin MCMC in continuous time.
Stein Variational Gradient Descent (SVGD): a deterministic particle method that bridges both worlds.
Practical posterior inference for Bayesian neural networks.

Prerequisites

Probability theory (Bayes’ rule, KL divergence, expectations).
Wasserstein gradient flows from Part 3.
Light stochastic calculus intuition (Brownian motion, Itô integral).
Python / PyTorch for the experiments.

1. The Inference Problem

Bayesian inference asks for the posterior

PDE and Machine Learning (3): Variational Principles and Optimization

Fri, 31 May 2024 09:00:00 +0000

What is the essence of neural-network training? When we run gradient descent in a high-dimensional parameter space, is there a deeper continuous-time dynamics at work? As the network width tends to infinity, does discrete parameter updating converge to some elegant partial differential equation? The answers live at the intersection of the calculus of variations, optimal transport, and PDE theory.

The last decade of deep-learning success has rested mostly on engineering intuition. Recently, however, mathematicians have made a striking observation: viewing a neural network as a particle system on the space of probability measures, and studying its evolution under Wasserstein geometry, exposes the global structure of training — convergence guarantees, the role of over-parameterization, the meaning of initialization. The tool that makes this visible is the variational principle — from least action in physics, through the JKO scheme of modern optimal transport, to the mean-field limit of neural networks.

PDE and Machine Learning (2) — Neural Operator Theory

Thu, 16 May 2024 09:00:00 +0000

A classical PDE solver — finite difference, finite element, spectral — is a function: feed it one initial condition and one set of coefficients, get back one solution. A PINN is the same kind of object dressed in neural-network clothes: each new initial condition demands a fresh round of training. Switch the inflow velocity on a wing or move a single sensor reading in a forecast and you reset the clock.

PDE and Machine Learning (1): Physics-Informed Neural Networks

Wed, 01 May 2024 09:00:00 +0000

Series chapter 1 — about a 35-minute read. This is the foundation of the entire series. Neural operators, variational principles, score matching — every later chapter is, at heart, the same idea: how do we encode physical or mathematical constraints directly into the optimisation objective of a neural network? Get PINNs right and the rest is “swap one constraint for another”.

1 Prologue: a metal rod

Suppose you want the temperature distribution $u(x,t)$ along a metal rod. Half a century of numerical analysis offers two standard answers:

Ordinary Differential Equations (18): Frontiers and Series Finale

Mon, 15 Apr 2024 09:00:00 +0000

The journey ends here. Eighteen chapters ago we picked up a falling apple. Today we’re going to finish in the same vein in which we began – by treating ODEs as the universal language of change – but standing on a much taller mountain.

This chapter does three things. First, it surveys four active research frontiers that are reshaping how we model dynamical systems: Neural ODEs, delay equations, stochastic differential equations, and fractional calculus. Second, it reviews the entire series with a problem-solving flowchart and a chapter-by-chapter map. Third, it draws explicit connections from the classical theory you have just mastered to modern machine learning – the place where ODEs are most alive in 2025.

Ordinary Differential Equations (17): Physics and Engineering Applications

Fri, 29 Mar 2024 09:00:00 +0000

Differential equations are not a pure mathematical game – they are the language for understanding the physical world. From celestial motion to circuit response, from a swinging pendulum to vortex shedding behind a bridge cable, every dynamical system “speaks” ODE.

This chapter is a deliberate tour through five canonical applications. Each one will pay back the entire ODE toolkit we built in chapters 1-16: phase planes, eigenvalues, Laplace transforms, modal analysis, conservation laws, numerical integration, control. None of the examples is a “toy” – they are all genuine working physics, written tightly so that the structure remains visible.

Ordinary Differential Equations (16): Fundamentals of Control Theory

Tue, 12 Mar 2024 09:00:00 +0000

When you steer a car you constantly correct based on lane position. A thermostat compares room temperature with the setpoint and adjusts a heater. A rocket gimbal nudges its thrust vector to keep the booster vertical. Strip away the hardware and the same idea remains: measure, compare, act. Control theory is the mathematics of that loop – and its native language is the ordinary differential equation.

This chapter shows how the entire ODE toolkit – Laplace transforms (Ch 4), linear systems (Ch 6), eigenvalue stability (Ch 7), nonlinear stability (Ch 8) – collapses into a single unified discipline whose job is no longer to describe dynamics, but to design them.

Ordinary Differential Equations (15): Population Dynamics

Sat, 24 Feb 2024 09:00:00 +0000

Why do lynx and snowshoe hare populations cycle with eerie regularity over a 10-year period? Why does introducing a single new species sometimes collapse an entire ecosystem? Why do similar competitors sometimes coexist and sometimes drive each other extinct? The answers are not in the species; they are in the equations relating the species. This chapter walks through the canonical models of mathematical ecology: from the single-population logistic and Allee models to multi-species competition, predator-prey oscillations, age structure, and spatial spread.

Ordinary Differential Equations (14): Epidemic Models and Epidemiology

Wed, 07 Feb 2024 09:00:00 +0000

In early 2020 the entire world watched a small system of ordinary differential equations decide policy. “Flatten the curve” was not a slogan; it was the intuition of a specific equation. Herd immunity was not a guess; it was the threshold $1 - 1/R_0$ derived in a single line. The SIR model – four lines of math, written down in 1927 by Kermack and McKendrick – turned out to be precise enough to drive trillion-dollar decisions.

Ordinary Differential Equations (13): Introduction to Partial Differential Equations

Sun, 21 Jan 2024 09:00:00 +0000

Once a quantity depends on more than one variable, the ODE world splinters into a vastly richer one: partial differential equations. Heat in a metal rod is a function of position and time; a vibrating string moves in space and time; a steady electrostatic potential lives in three spatial dimensions. ODE techniques become tools, not solutions – separation of variables turns one PDE into a family of ODEs, the eigenvalues of those ODEs become the spectrum of the operator, and superposition stitches everything back together.

Ordinary Differential Equations (12): Boundary Value Problems

Thu, 04 Jan 2024 09:00:00 +0000

An initial value problem hands you a starting state and asks you to march forward. A boundary value problem (BVP) hands you partial information at two different points and asks you to find a path that fits both. The change is small in wording, large in consequence: BVPs can have a unique solution, no solution at all, or infinitely many. They demand a fundamentally different toolkit – one that is iterative, global, and intimately connected to linear algebra.

Ordinary Differential Equations (11): Numerical Methods

Mon, 18 Dec 2023 09:00:00 +0000

Almost every interesting differential equation in science and engineering refuses to yield a closed-form solution. Nonlinear vector fields, variable coefficients, ten thousand coupled state variables – pen and paper give up long before the problem does. Numerical integration is the way through. This chapter builds, evaluates, and compares the small set of algorithms that solve essentially every ODE you will meet, and gives you the diagnostics to know when an integrator is lying to you.

HCGR: Hyperbolic Contrastive Graph Representation Learning for Session-based Recommendation

Sat, 16 Dec 2023 09:00:00 +0000

A user opens a sneaker app, taps “running shoes”, drills into a brand, then a price band, then a single SKU. That trajectory is a tree: each click narrows the candidate set roughly multiplicatively. In Euclidean space you need many dimensions to keep all the leaves of that tree apart, because Euclidean volume only grows polynomially with radius. In hyperbolic space volume grows exponentially with radius, so the tree fits naturally — a few dimensions are enough to keep the whole long tail untangled.

Ordinary Differential Equations (10): Bifurcation Theory

Fri, 01 Dec 2023 09:00:00 +0000

A lake stays clear for decades, then turns murky in a single season. A power grid hums along stably, then trips into a cascading blackout in seconds. A column under slowly increasing load is straight, straight, straight – and then suddenly buckles.

These are not failures of prediction. They are the universe doing exactly what dynamical systems theory says it must do: cross a bifurcation. When a parameter drifts past a critical value, the topology of phase space rearranges itself, and what was once impossible becomes inevitable. This chapter is about classifying those rearrangements. There turn out to be only a handful of them, and once you see the catalogue you start spotting them everywhere.

ODE Chapter 9: Chaos Theory and the Lorenz System

Tue, 14 Nov 2023 09:00:00 +0000

In 1961, Edward Lorenz restarted a weather simulation from a rounded-off number – 0.506 instead of 0.506127. Within simulated weeks the forecast was unrecognisable. That single accident gave us the butterfly effect and turned chaos from a metaphor into a science. The lesson is profound and sober: equations that are exactly deterministic can still be practically unpredictable.

What You Will Learn

The four conditions that together define chaos
The Lorenz system: paradigm of deterministic chaos
Butterfly effect, visualised on the attractor itself
Lyapunov exponents: numerical fingerprint of chaos
Bifurcation cascades and the period-doubling route to chaos
Other chaotic systems: Rossler and the double pendulum
Strange attractors, fractal dimension, stretching-and-folding
Applications: weather, encryption, controlling chaos, ensemble forecasting

Prerequisites

Chapter 8: nonlinear systems, phase portraits, limit cycles
Chapter 7: stability and bifurcation basics
Comfort with 3D visualization

What Is Chaos?

A chaotic system satisfies all four of:

ODE Chapter 8: Nonlinear Systems and Phase Portraits

Sat, 28 Oct 2023 09:00:00 +0000

The real world is nonlinear. Predator-prey cycles, heartbeat rhythms, neuron firing – none of these can be captured by linear equations. When superposition fails, the world acquires new behaviors: limit cycles, multiple equilibria, bistability, hysteresis. This chapter gives you the geometric and analytic tools to read those behaviors directly off a 2D phase portrait.

What You Will Learn

Why nonlinear systems are fundamentally different from linear ones
Lyapunov stability visualized: level sets, bowls, and basins
Linearization vs. the full nonlinear picture (Hartman-Grobman in action)
Lotka-Volterra predator-prey: closed orbits and conserved quantities
Competition models: four canonical outcomes
Van der Pol oscillator and the geometry of limit cycles
Gradient and Hamiltonian systems
Poincare-Bendixson: why 2D systems cannot be chaotic

Prerequisites

Chapter 6: linear systems, phase portrait classification
Chapter 7: stability, linearization, Lyapunov functions

From Linear to Nonlinear

Linear systems obey superposition: if $\mathbf{x}_1$ and $\mathbf{x}_2$ are solutions, so is $c_1\mathbf{x}_1 + c_2\mathbf{x}_2$. This is the engine that powers the entire toolkit of Chapters 1-6 – exponential ansatz, eigenvectors, fundamental matrices.

Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)

Sun, 15 Oct 2023 09:00:00 +0000

You have non-linear data and a linear algorithm. The kernel trick lets you run that linear algorithm on the non-linear data – without ever writing down the high-dimensional feature map. This guide builds the intuition first, then the math, then a practical toolkit you can ship.

What You Will Learn

The kernel trick: why it works and what it actually buys you
Mathematical foundations: positive-definite kernels, RKHS, Mercer’s theorem
Common kernels: RBF, polynomial, linear, Matern, periodic, sigmoid
Hyperparameter tuning: grid search, random search, marginal likelihood
Troubleshooting: overfitting, underfitting, numerical instability, scale
A kernel-selection decision tree for SVM, GP, and Kernel PCA

Prerequisites

Linear algebra basics (dot products, eigendecomposition)
Familiarity with SVM or Gaussian Processes (conceptual)
Python + scikit-learn

Why Kernel Methods Matter

ODE Chapter 7: Stability Theory

Wed, 11 Oct 2023 09:00:00 +0000

A small push hits a system. Does it return to rest, drift away, or break entirely? That single question decides whether bridges survive storms, ecosystems recover from droughts, and economies bounce back from crises. Stability theory answers it – and it does so without ever solving the differential equation. We will learn to read the destiny of a system off the geometry of its phase plane.

What You Will Learn

Three precise notions: Lyapunov stable, asymptotically stable, unstable
Linearization via the Jacobian and the Hartman-Grobman theorem
Lyapunov’s direct method – proving stability with energy-like functions
LaSalle’s invariance principle for borderline cases
Trace-determinant classification of all 2D linear systems
Four canonical bifurcations: saddle-node, transcritical, pitchfork, Hopf
Worked applications: pendulum, predator-prey, inverted pendulum control

Prerequisites

Chapter 6: linear systems, eigenvalues, phase portraits
Multivariable calculus: partial derivatives, Jacobian matrix

A Visual Tour Before the Theory

Stability is, at heart, a geometric statement about how trajectories move in phase space. Six pictures tell the entire story of 2D linear systems.

ODE Chapter 6: Linear Systems and the Matrix Exponential

Sun, 24 Sep 2023 09:00:00 +0000

One equation describes one quantity. The world is rarely that obliging. Predator and prey populations push each other up and down. Currents and voltages in an RLC network oscillate together. Chemical species in a reaction network feed into one another. The moment two unknowns share an equation, you have a system, and a single $y'=ay$ is no longer enough.

The miracle of the linear case is this: the scalar formula $y(t)=e^{at}y_0$ generalizes verbatim once you learn what $e^{At}$ means for a matrix $A$. Linear algebra and ODEs fuse into one object — the matrix exponential — and its eigenstructure tells you everything about the long-term behavior, the geometry of the flow, and the physics of normal modes and beats.

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Wed, 20 Sep 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.

ODE Chapter 5: Power Series and Special Functions

Thu, 07 Sep 2023 09:00:00 +0000

Some ODEs have no solutions in terms of familiar functions. The Bessel equation, the Legendre equation, the Airy equation – all arise naturally in physics (heat conduction in cylinders, gravitational fields of planets, quantum tunneling). Their solutions define entirely new functions. This chapter shows you how to find them using power series, why the Frobenius extension is forced upon us at singular points, and why the same handful of “special functions” keeps appearing across physics and engineering.

LAMP Stack on Alibaba Cloud ECS: From Fresh Instance to Production-Ready Web Server

Fri, 01 Sep 2023 09:00:00 +0000

You have a fresh ECS instance and SSH access. Your goal is a public website running Apache, PHP and MySQL. Between you and that goal sit three classes of problems that catch every beginner the first time:

Network reachability – packets are silently dropped at the cloud security group, the OS firewall, or the listening socket, and the symptom is the same in all three cases: nothing happens.
Service wiring – Apache, PHP and MySQL are three separate processes that have to find each other through file extensions, Unix sockets and TCP ports. Each interface has its own failure mode.
Identity and permissions – Apache runs as www-data, MySQL runs as mysql, files are owned by root after wget. The wrong combination produces 403, “Access denied”, or chmod 777 desperation.

This guide walks through all of them in the order you actually hit them on day one, then keeps going into the things that show up on day thirty: TLS, virtual hosts, backups, source compilation, and when to stop running everything on a single box.

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Sat, 26 Aug 2023 09:00:00 +0000

A plain autoencoder compresses and reconstructs. A variational autoencoder learns something far more useful: a smooth, structured latent space you can sample from to generate genuinely new data. That single change — making the encoder output a distribution instead of a vector — turns the network from a fancy compressor into a generative model with a tractable likelihood lower bound.

This guide walks the full path: why autoencoders fail at generation, how the ELBO derivation gets you to the loss function, why the reparameterization trick is the trick that makes everything trainable, a complete PyTorch implementation, and a tour of every common failure mode with concrete fixes.

paper2repo: GitHub Repository Recommendation for Academic Papers

Tue, 22 Aug 2023 09:00:00 +0000

You read a paper, want the code, and the “code available at” link is dead, missing, or points to a stub. Search engines fall back to keyword matching over the README, which works for popular repos with descriptive names and dies on everything else. paper2repo (WWW 2020) frames this as a cross-platform recommendation problem: learn one embedding space in which a paper abstract and a GitHub repository are directly comparable by dot product, then rank.

ODE Chapter 4: The Laplace Transform

Mon, 21 Aug 2023 09:00:00 +0000

The Laplace transform turns calculus into algebra. Instead of grinding through integration, guessing trial solutions, and bolting on initial conditions at the end, you transform the entire ODE — equation, forcing, and initial data — into a single polynomial equation in a complex variable $s$. You solve it like a high-school problem, then transform back. Along the way, the shape of the solution becomes geometry: poles in the left half of the complex plane decay, poles on the right blow up, poles on the imaginary axis ring forever. This chapter develops that picture from first principles and connects it to the engineering tools — transfer functions, Bode plots, PID control — that turned the Laplace transform into the lingua franca of dynamics.

ODE Chapter 3: Higher-Order Linear Theory

Fri, 04 Aug 2023 09:00:00 +0000

A first-order ODE has memory of one number; a second-order ODE has memory of two. That tiny extra degree of freedom is what lets the same equation describe a plucked guitar string, the suspension of your car, the L-C tank circuit inside an FM radio, and the swaying of a tall building in the wind. In every case the same three regimes appear – oscillate, return-with-a-touch-of-overshoot, or crawl back – and the same algebraic gadget, the characteristic equation, predicts which one happens.

ODE Chapter 2: First-Order Methods

Tue, 18 Jul 2023 09:00:00 +0000

A bank account, a drug clearing the bloodstream, a tank of brine, a charging capacitor — they all obey the same kind of equation: a first-order ODE. The trick is recognising which of four shapes you are looking at, because each shape has a closed-form move that solves it cleanly. By the end of this chapter you will pattern-match an unfamiliar first-order equation in seconds and know exactly which lever to pull.

Session-based Recommendation with Graph Neural Networks (SR-GNN)

Thu, 13 Jul 2023 09:00:00 +0000

A user clicks A, B, C, B, D. A sequence model reads this as five tokens and folds them into a hidden state. SR-GNN sees a graph in which the edge B -> C survives even after the user returns to B, the node B is reused (so its in/out neighbours both inform its embedding), and the geometry of the click stream is preserved as adjacency. That structural insight is why SR-GNN (Wu et al., AAAI 2019) outperforms purely sequential baselines such as GRU4Rec and NARM on standard session-based recommendation (SBR) benchmarks.

ODE Chapter 1: Origins and Intuition

Sat, 01 Jul 2023 09:00:00 +0000

Everything around you is changing. Coffee cools, populations grow, pendulums swing, viruses spread, stocks oscillate, planets orbit. None of these systems are described by what something equals — they are described by how fast something changes. That second mode of description is what differential equations are for, and learning to read them is, quite literally, learning to read the language physics and biology are written in.

This chapter rebuilds your intuition from scratch. We start with a single cup of coffee, derive the same equation that governs radioactive decay and capacitor discharge, then climb upward to direction fields, classification, and the existence-and-uniqueness theorem that tells you when an ODE has a sensible answer at all.

Multi-Cloud and Hybrid Architecture

Wed, 14 Jun 2023 09:00:00 +0000

The first article in this series asked “what is the cloud, and why does it matter?” Eight articles later, the question has matured into something more practical: which clouds, in what combination, and how do you operate the result without losing your mind? Multi-cloud and hybrid architectures are how serious organizations answer that question. They distribute workloads across providers and on-premises infrastructure for resilience, cost optimization, and strategic flexibility – but they introduce a new class of problems that single-cloud architectures never face.

Cloud Operations and DevOps Practices

Fri, 26 May 2023 09:00:00 +0000

In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran rm -rf on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not “be careful with rm”. The lesson is that operations is a system - tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.

Cloud Security and Privacy Protection

Sun, 07 May 2023 09:00:00 +0000

In 2019 Capital One lost a hundred million customer records. The exploit chain was small: a misconfigured WAF allowed server-side request forgery against the EC2 metadata endpoint, that endpoint handed back IAM credentials, and the IAM role those credentials belonged to had wildcard s3:* on every bucket in the account. One misconfiguration, one over-broad role, one rule the security team had not written. The bill, before legal: more than 80 million dollars.

Cloud Network Architecture and SDN

Tue, 18 Apr 2023 09:00:00 +0000

A cloud platform is, in the end, a network with computers attached. The compute layer scales by adding boxes; the storage layer scales by adding disks; the network layer is what makes those boxes and disks behave as a single coherent system. Get the network right and the rest of the stack feels effortless. Get it wrong – a missing route, a 5-tuple mismatch on a security group, an under-provisioned load balancer – and the whole platform goes dark.

Cloud Storage Systems and Distributed Architecture

Thu, 30 Mar 2023 09:00:00 +0000

When Netflix stores petabytes of video, when Instagram serves billions of photos, when a quant fund replays a year of market data in minutes – behind every one of these workloads is a distributed storage system. Storage looks deceptively simple from a developer’s window (PUT key, GET key), but the moment you cross the boundary of a single machine, you inherit a stack of problems that has driven decades of research: how to survive disk failures, how to scale linearly, how to provide a consistency model that does not surprise the application, and how to do all of this while paying cents per gigabyte rather than dollars.

Learning Rate: From Basics to Large-Scale Training

Mon, 13 Mar 2023 09:00:00 +0000

Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, learning rate is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.

Cloud-Native and Container Technologies

Sat, 11 Mar 2023 09:00:00 +0000

The shift from monolithic applications to cloud-native architectures is one of the most consequential changes in software engineering this decade. The headline – containers and Kubernetes – is well known. The interesting story is why this stack won, what each layer actually does, and where the seams are that determine whether your platform feels effortless or feels like a maze.

This article walks the cloud-native stack from first principles. We start with the architectural shift that motivates everything else, then dig into what a container really is at the Linux kernel level, climb up to Kubernetes orchestration, examine when a service mesh earns its complexity, and finish with packaging and delivery via Helm and GitOps. Examples are deliberately concrete: copy-pastable Dockerfiles, real manifests, and the trade-offs that matter when you run this in production.

Virtualization Technology Deep Dive

Mon, 20 Feb 2023 09:00:00 +0000

Without virtualization, there is no cloud. Every EC2 instance, every Lambda invocation, every Kubernetes pod ultimately stands on the same trick: lying convincingly to an operating system about the hardware underneath it. This article walks the full stack – from the CPU instructions that make the trick cheap, through the four hypervisors that dominate the market, to the production-grade tuning knobs that decide whether your VMs run at 70 % or 99 % of bare metal.

Cloud Computing Fundamentals and Architecture

Wed, 01 Feb 2023 09:00:00 +0000

Every team building software in 2025 inherits the same buy-or-rent question their predecessors faced – only the answer has flipped. Twenty years ago you put hardware in a closet; today you describe the hardware in YAML and a global provider conjures it up in seconds, bills it by the second, and tears it down when you stop paying. Cloud computing is not just “someone else’s computer”. It is a programmable, metered, multi-tenant abstraction over compute, storage and networking that has fundamentally changed how businesses are built and how engineers spend their day.

Graph Contextualized Self-Attention Network (GC-SAN) for Session-based Recommendation

Sun, 15 Jan 2023 09:00:00 +0000

In session-based recommendation you only see a short anonymous click sequence – no user profile, no long history, no demographics. Every signal you have lives inside that single window. GC-SAN (IJCAI 2019) takes the strongest two ideas of the time – SR-GNN’s session graph and the Transformer’s self-attention – and stacks them: a graph view captures local transition patterns and loops, a sequence view captures long-range intent, and a tiny weighted sum decides how much of each to trust. The result is a clean “best of both worlds” baseline that is genuinely hard to beat at its parameter budget.

Computer Fundamentals: Deep Dive and System Integration

Sat, 14 Jan 2023 09:00:00 +0000

We’ve spent five chapters opening one box at a time — the CPU, the cache hierarchy, storage, the motherboard and GPU, the network and power supply. Each part is interesting on its own, but a computer is not its components. A computer is what happens when those components have to agree, every nanosecond, on what to do next.

This finale is about that conversation. We’ll wire everything together into a single picture, look at the system through the eyes of a profiler, revisit the 80-year-old design tension that still shapes every chip you buy, and end by looking forward — chiplets, photonic interconnects, and the quietly arriving quantum era.

Lipschitz Continuity, Strong Convexity & Nesterov Acceleration

Tue, 27 Dec 2022 09:00:00 +0000

A surprising amount of “optimizer folklore” collapses into three concepts:

How steep is the gradient? Lipschitz smoothness ($L$-smoothness) caps the step size.
How sharp is the bottom? $\mu$-strong convexity sets the convergence rate and forces the minimizer to be unique.
Can we get there faster without losing stability? Nesterov acceleration and adaptive restart turn the per-condition-number cost from $\kappa$ into $\sqrt{\kappa}$.

This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer “what step size, what rate, is acceleration worth it?”

Computer Fundamentals: Network, Power, and Troubleshooting

Sat, 24 Dec 2022 09:00:00 +0000

Why does the gigabit NIC on your motherboard sometimes negotiate down to 100 Mbps? Why does a brand-new build with a 650 W “Gold” PSU randomly reboot under heavy GPU load? Why does the room next to the server rack always feel warm? These are the everyday consequences of two systems that most people never look at: the network I/O pipeline and the power-and-cooling chain that keeps the silicon alive.

Optimizer Evolution: From Gradient Descent to Adam (and Beyond, 2025)

Fri, 09 Dec 2022 09:00:00 +0000

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.

Computer Fundamentals: Motherboard, Graphics, and Expansion

Sat, 03 Dec 2022 09:00:00 +0000

A modern desktop motherboard is an unusually honest object. Every important design decision — how many PCIe lanes the CPU exposes, which slots are wired straight to the CPU and which tunnel through the chipset, how the VRM is sized to feed a 250 W processor, why the second long PCIe slot only runs at ×4 — is laid out in plain copper on the PCB. If you can read the board, you can predict almost every performance cliff a user will hit. This fourth instalment of the Computer Fundamentals Deep Dive Series teaches that reading skill, then turns it inward to the GPU, where the same lesson applies in miniature: a GPU is a chip whose entire architecture exists to keep thousands of arithmetic lanes fed with data, and almost everything else — caches, schedulers, tensor cores, HBM stacks — is in service of that goal.

LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

Sat, 26 Nov 2022 09:00:00 +0000

Session-based recommendation lives or dies on the click graph. New items have no edges. Long-tail items have a handful of noisy edges. Yet every item ships with a title and a description that the model never reads. LLMGR plugs that hole: treat the LLM as a “semantic engine” that turns text into representations a graph encoder can fuse with, then let a GNN do what it does best – rank. The headline result on Amazon Music/Beauty/Pantry: HR@20 up ~8.68%, NDCG@20 up ~10.71%, MRR@20 up ~11.75% over the strongest GNN baseline, with the largest uplift concentrated on cold-start items.

Computer Fundamentals: Storage Systems (HDD vs SSD)

Sat, 12 Nov 2022 09:00:00 +0000

Why can a single SSD swap “resurrect” a five-year-old laptop? Why does a TLC drive rated for only 1 000 P/E cycles still last more than a decade for normal users? Why does a brand-new SSD that benchmarks at 3 500 MB/s sometimes collapse to 50 MB/s after a few weeks? This third instalment of the Computer Fundamentals Deep Dive Series answers those questions from first principles. We will look at how rotating magnetic platters compare with charge-trap NAND cells, how the bandwidth of an interface (SATA, PCIe Gen 3/4/5) interacts with the parallelism of a protocol (AHCI vs NVMe), how RAID levels trade capacity for fault tolerance, how a file system organises bytes on a raw block device, and how to keep all of this fast and safe in production.

Computer Fundamentals: Memory and Cache Systems

Sat, 22 Oct 2022 09:00:00 +0000

A CPU core can complete a multiplication in roughly 0.3 ns. A spinning hard disk needs 10 ms to seat its head over a sector. Between those two numbers sits a factor of about 30 million. Every line of memory engineering — caches, DRAM cells, page tables, TLBs, ECC, NUMA, channels — is a coordinated answer to that single, brutal asymmetry.

This is part 2 of the Computer Fundamentals Deep Dive. We will not stop at “DDR is fast and RAM is volatile”. We will trace a single load instruction from the CPU pipeline through the L1, L2, L3 caches, the TLB, the page table, the memory controller, the channels, and finally the DRAM cells themselves — and look at what each layer is actually doing, and why.

Computer Fundamentals: CPU and the Computing Core

Sat, 01 Oct 2022 09:00:00 +0000

Why does your 100 Mbps internet only download at about 12 MB/s? Why does a “1 TB” hard drive show only 931 GB in Windows? Why does a 32-bit system top out around 3.2 GB of usable RAM? And what actually happens, cycle by cycle, when the CPU runs your code?

This is part 1 of the Computer Fundamentals series. We start from bits and bytes, then go down into the CPU itself: pipelines, caches, branch prediction, out-of-order execution, multiple cores, and SMT. By the end you should be able to read a CPU spec sheet — or a perf profile — and know what each number is paying for.

LeetCode Patterns: Greedy Algorithms

Tue, 13 Sep 2022 09:00:00 +0000

Greedy is the algorithm paradigm that feels too good to be true: at every step, take the choice that looks best right now, never look back, and somehow end up at the global optimum. When it works, the code is almost embarrassingly short. When it doesn’t, it produces confidently wrong answers — which is why the real skill is not writing greedy code, but recognising when greedy is allowed.

This article walks through the structural reason greedy is correct on some problems and broken on others, then applies that lens to seven LeetCode classics: Jump Game, Jump Game II, Gas Station, Best Time to Buy and Sell Stock II, Non-overlapping Intervals, Task Scheduler, and Partition Labels.

LeetCode Patterns: Stack and Queue

Mon, 29 Aug 2022 09:00:00 +0000

Stacks and queues look unassuming next to graphs or DP, but they sit underneath an astonishing fraction of interview problems. The reason is simple: most algorithmic questions are really questions about order of access. Stacks give you LIFO (last in, first out); queues give you FIFO (first in, first out); and once you add the variants — monotonic stack, deque, priority queue — you have efficient answers for bracket matching, next-greater-element, sliding-window extrema, top-K, BFS, and a long tail of “implement X using Y” puzzles.

LeetCode Patterns: Backtracking Algorithms

Sun, 14 Aug 2022 09:00:00 +0000

Backtracking is the algorithm you reach for whenever a problem asks you to enumerate something — every permutation, every subset, every legal board, every path through a grid. It is brute force with a brain: you build a candidate solution one decision at a time, abandon it the moment a constraint says “this cannot work”, and undo your last move so the next branch sees a clean slate. The whole technique fits in three lines:

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

Fri, 05 Aug 2022 09:00:00 +0000

Stuffing pixels, audio, and video into a language model so it can “see,” “hear,” and reason – that was a research curiosity before CLIP landed in 2021. Today it’s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:

Alignment. How does the language model “understand” what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?
Task framing. The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn’t quietly drop several points of accuracy.
Cost. A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.
Evaluation. A model that scores 80 on MMBench can still hallucinate confidently on your customer’s invoice. Public benchmarks are the easy part.

This post follows the natural research arc – architecture, model families, downstream tasks, fine-tuning, evaluation, deployment – and tries to be specific enough at each stop that you can act on it. Less “what’s possible,” more “what to actually pick.”

Operating System Fundamentals: A Deep Dive

Mon, 01 Aug 2022 09:00:00 +0000

Open a terminal and type cat hello.txt. The instant you press Enter, at least seven layers of machinery wake up: bash parses the line, fork+execve launches the cat process, the kernel hands it a virtual address space, cat issues a read() syscall, the CPU traps into kernel mode, VFS dispatches to ext4, the block layer queues an NVMe request, the SSD DMA-writes the bytes back, an interrupt wakes cat, the bytes are copied through the page cache into the user buffer, and finally something appears on your screen.

LeetCode Patterns: Dynamic Programming Basics

Sat, 30 Jul 2022 09:00:00 +0000

Dynamic programming has a reputation for being the algorithm topic that separates “competent coder” from “interview wizard”. A lot of that reputation is unearned. DP is not a bag of clever tricks; it is a single recipe applied to problems that happen to have repeated subproblems. If you can answer three questions cleanly, you can solve almost any DP problem on LeetCode:

What does dp[i] actually mean? (state)
How do I build dp[i] from smaller answers? (transition)
What are the smallest answers I already know? (base case)

This article walks through that recipe, then applies it to the seven problems every DP study list eventually converges on: Climbing Stairs, House Robber, Coin Change, Longest Increasing Subsequence, 0/1 Knapsack, Longest Common Subsequence, and Edit Distance.

Proximal Operator: From Moreau Envelope to ISTA/FISTA and ADMM

Mon, 25 Jul 2022 09:00:00 +0000

When your objective contains a non-smooth piece (sparse regularisation, total variation, an indicator of a constraint set) or a constraint that is hard to handle directly, “just do gradient descent” stalls – there is no gradient at the kink, or every step violates feasibility. The proximal operator is the engineered, beautiful workaround: think of each update as “take a step on the smooth part, then run a tiny penalised minimisation that pulls the iterate back toward a structured solution”.

Graph Neural Networks for Learning Equivariant Representations of Neural Networks

Fri, 22 Jul 2022 09:00:00 +0000

You can shuffle the hidden neurons of a trained MLP and get the exact same function back – but the flat parameter vector now looks completely different. This single fact ruins most attempts at “learning over neural networks”: naive representations treat two functionally identical models as two unrelated points in parameter space, and the downstream learner wastes capacity rediscovering a symmetry it should have for free. This paper – Graph Neural Networks for Learning Equivariant Representations of Neural Networks (Kofinas et al., ICML 2024) – proposes the clean fix: turn the network itself into a graph, then use a GNN whose architecture natively respects the relevant permutation symmetry.

LeetCode Patterns: Binary Tree Traversal and Construction

Fri, 15 Jul 2022 09:00:00 +0000

A binary tree problem is rarely about the tree. It is about the order in which you touch nodes and what you remember from the children before deciding what to do at the parent. Once those two ideas click, the four traversal orders, the iterative rewrites, the construction problems, and even classics like Validate BST and Maximum Depth all collapse into a handful of variations on the same recipe. This article builds that recipe end to end.

LeetCode Patterns: Binary Search

Thu, 30 Jun 2022 09:00:00 +0000

Binary search is the algorithm everyone thinks they understand until they have to write it under interview pressure. The idea is one sentence — halve the search space at every step — but the implementation is a minefield of off-by-one errors, infinite loops, and subtly wrong return values. The goal of this article is not to give you yet another recitation of the standard template; it is to give you a mental model that explains why each template looks the way it does, and a small toolkit (three templates plus the answer-space pattern) that covers the vast majority of LeetCode problems.

LeetCode Patterns: Sliding Window Technique

Wed, 15 Jun 2022 09:00:00 +0000

If you have ever caught yourself writing a double for loop to inspect every contiguous subarray, sliding window is probably the optimisation you are missing. It turns an $O(nk)$ or $O(n^2)$ scan into a single linear pass by reusing the work it has already done. This article walks through the technique from first principles, then drills four canonical LeetCode problems plus a monotonic-deque variant.

1. The Idea in One Picture

A sliding window is a contiguous range [left, right] over an array or string. Instead of recomputing everything when the range moves, we add the element entering on the right and remove the element leaving on the left. Each element is touched at most twice, so the total cost is $O(n)$.

LeetCode Patterns: Linked List Operations

Tue, 31 May 2022 09:00:00 +0000

A linked list is the simplest data structure that forces you to think in pointers. Arrays let you index in $O(1)$ and forget about layout; linked lists hand you a head pointer and ask, “now what?” That single shift — from indices to references — is what makes linked-list problems so common in interviews. They are short to state, brutal to get right, and reward exactly the habits good engineers build: drawing pictures, naming pointers, and never dereferencing without checking for None.

LeetCode Patterns: Two Pointers

Mon, 16 May 2022 09:00:00 +0000

Hash tables buy you speed by spending memory. Two pointers is the opposite trade: spend a little structural assumption — the array is sorted, the list might have a cycle, the answer lives in a contiguous window — and you get $O(n)$ time with $O(1)$ extra space. The pattern looks trivial in code (two indices and a while loop) but it has more failure modes than any other beginner technique: off-by-one indices, infinite loops, missed duplicates, wrong pointer moved on tie. The cure is to think in invariants rather than in moves.

LeetCode Patterns: Hash Tables

Sun, 01 May 2022 09:00:00 +0000

A hash table is the cheapest superpower in your toolbox. You spend a constant amount of memory per stored item, and in return every “is x in here?” question costs roughly one CPU instruction. Whole families of O(n²) brute-force solutions collapse into a single O(n) pass once you reach for one.

This article is the first installment of the LeetCode Patterns series. We will build hash table intuition from scratch, then work through four template problems — Two Sum, Group Anagrams, Longest Substring Without Repeating Characters, and Top K Frequent Elements — each illustrating a reusable pattern you will see again and again on harder problems.

Linux Pipelines and File Operations: Composing Tools into Data Flows

Sat, 02 Apr 2022 09:00:00 +0000

The biggest productivity jump on Linux is not memorising more commands. It is learning to compose small tools into clean data flows. The pipe operator | is the embodiment of the Unix philosophy: each tool does one thing and does it well (grep only filters, awk only extracts fields, sort only sorts), and you chain them into a pipeline that is readable, debuggable, and obvious to maintain. This article starts from the data-flow model – stdin, stdout, stderr and the file descriptors behind them – then walks through every common redirection form (>, >>, <, 2>, 2>&1, &>), builds up the text-processing toolchain (grep, awk, sed, cut, tr, sort, uniq, xargs, tee), and ends with two patterns most introductions skip: named pipes (FIFOs) and process substitution. By the end you should be able to replace many “I need to write a script” tasks with one or two readable command lines, and read other people’s one-liners without squinting.

Linux Process and Resource Management: From `top` to cgroups

Sun, 20 Mar 2022 09:00:00 +0000

The job of a Linux operator is rarely “memorise more commands”. It is to take a fuzzy symptom — the site feels slow, the API timed out, the box is unresponsive — and quickly map it to the right axis: is the CPU saturated, is memory being eaten by cache (which is fine) or by a runaway process (which is not), is the disk queue full, is some socket leaking? Once the axis is named, the tool follows almost mechanically.

Linux Service Management: systemd, systemctl, and journald

Mon, 07 Mar 2022 09:00:00 +0000

A “service” on Linux is a long-running background process whose job is to be there when something needs it: synchronise the clock, listen for SSH connections, accept HTTP requests, run a backup at 3 AM. You almost never start one of these by hand. Something has to start them at boot, restart them when they crash, capture their logs, decide what depends on what, and shut everything down cleanly when the machine powers off. On every modern distribution that something is systemd.

Linux User Management: Users, Groups, sudo, and Security

Tue, 22 Feb 2022 09:00:00 +0000

If you only ever ran useradd and passwd on a single laptop, you can probably get away without thinking about any of this. The moment more than one human (or more than one service) shares a host, “user management” stops being paperwork and starts being the security model: it decides who can log in, which UID owns the files a process writes, which commands sudo will lift to root, and how long a stolen password remains useful.

Linux Package Management: apt, dnf, pacman, and Building from Source

Wed, 09 Feb 2022 09:00:00 +0000

Most people learn package management as three commands: install, remove, upgrade. That works until something goes wrong - a dependency conflict, an upgrade that won’t apply, a kernel that doesn’t boot, a mirror that times out from inside China. At that point you need a model of what is actually happening: what a package contains, what the manager is solving for, where it stores state, and how the difference between Debian’s apt/dpkg and Red Hat’s dnf/rpm shows up at 2 a.m. on a production box.

Linux Disk Management: Partitions, Filesystems, LVM, and the Mount Stack

Thu, 27 Jan 2022 09:00:00 +0000

Disk problems in production almost never have a one-line fix. You are usually navigating a layered stack: the block device (a physical or virtual disk), the partition table (MBR or GPT), an optional LVM layer that decouples filesystems from disks, the filesystem driver (ext4, xfs, btrfs) that gives meaning to the raw bytes, and finally the mount point in the directory tree that applications actually open files through. Most outages I have seen become tractable the moment you can name which layer is misbehaving.

Linux File Permissions: rwx, chmod, chown, and Beyond

Fri, 14 Jan 2022 09:00:00 +0000

File permissions look elementary — chmod 755, done — but they remain one of the top causes of production incidents I see: a service won’t start, a deploy script silently does nothing, Nginx returns 403, a shared directory leaks, or rm refuses on a file that “should” be removable. Memorising magic numbers does not get you out of any of these. What does is understanding three things at the same time:

Linux Basics: Core Concepts and Essential Commands

Sat, 01 Jan 2022 09:00:00 +0000

The “difficulty” of Linux rarely lives in the commands themselves. The hard part is whether you have a clear map of the system: why it dominates servers, what multi-user and per-file permissions actually buy you, what changes when you switch between Debian and Red Hat lineages, and what to do in the first ten minutes after an SSH prompt opens. This post is the entry guide for the entire Linux series. It first builds the mental model – philosophy, distributions, the FHS tree – and then walks you through the commands you will use ten times an hour: cd ls pwd, cp mv rm mkdir, cat less head tail, find grep, plus pipelines, redirection, SSH, and a quick taste of permissions and processes. Each topic is intentionally kept short; depth lives in the dedicated articles (File Permissions, Disk Management, User Management, Service Management, Process Management, Package Management, Advanced File Operations).

About

Mon, 01 Jan 0001 00:00:00 +0000

Projects

Mon, 01 Jan 0001 00:00:00 +0000

Series

Mon, 01 Jan 0001 00:00:00 +0000