Standalone Articles on Chen Kai Blog

Solving Constrained Mean-Variance Portfolio Optimization Using Spiral Optimization

Wed, 21 Jan 2026 09:00:00 +0000

Markowitz’s mean-variance model is elegant until you add real trading constraints: “if you buy a stock at all, hold at least 5% of it” and “pick exactly 10 names from the S&P 500.” The closed-form quadratic program quietly mutates into a mixed-integer nonlinear program (MINLP), and the standard solver chain (Lagrange multipliers, KKT conditions, interior-point methods) stops working. The paper reviewed here applies the Spiral Optimization Algorithm (SOA), a population-based metaheuristic, to this problem and shows it can find competitive feasible solutions where gradient methods fail outright.

AI Agents Complete Guide: From Theory to Industrial Practice

Wed, 31 Dec 2025 09:00:00 +0000

A chatbot answers questions. An agent gets things done – it browses, runs code, calls APIs, queries databases, and iterates until the job is finished. The same LLM sits behind both, but the wrapper is different: an agent runs inside a loop with tools, memory, and the ability to inspect its own work.

This guide is the long-form version of that idea. It covers the four core capabilities (planning, memory, tool use, reflection), the major framework families, multi-agent collaboration, evaluation, and the production concerns that decide whether an agent ships or quietly fails on a Tuesday afternoon.

Prompt Engineering Complete Guide: From Zero to Advanced Optimization

Wed, 15 Oct 2025 09:00:00 +0000

The same model, two prompts: one gets 17% accuracy on grade-school math, the other gets 78%. The difference is not magic — it is prompt engineering. This guide shows you the techniques that work, the research behind them, and how to systematically optimize prompts for production.

What you will learn

Foundations — zero-shot, few-shot, many-shot, task decomposition, and the five-block prompt skeleton.
Reasoning techniques — Chain-of-Thought, Self-Consistency, Tree of Thoughts, Graph of Thoughts, ReAct.
Automation — Automatic Prompt Engineering (APE), DSPy, LLMLingua compression.
Practical templates — structured output, code generation, data extraction, multi-turn chat.
Evaluation and debugging — metrics, A/B testing, error analysis, the failure-mode toolkit.

Prerequisites. Basic Python; experience calling any LLM API. No math background required.

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

Mon, 22 Sep 2025 09:00:00 +0000

Real data matrices are almost never both square and full rank: correlated features, too few samples, and noise-induced ill-conditioning all make “matrix inverse” either undefined or numerically useless. The pseudoinverse (Moore-Penrose inverse) preserves the spirit of an inverse while dropping the impossible-to-meet requirements: it redefines the “solution” of a linear system as the least-squares solution, breaking ties by picking the one with minimum norm. This post derives the pseudoinverse from that least-squares viewpoint, gives the four Penrose conditions, builds it from the SVD, and connects this single object to the Eckart-Young low-rank approximation theorem, PCA, recommender-system matrix factorization, and LoRA fine-tuning.

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Thu, 24 Jul 2025 09:00:00 +0000

The moment your model contains a sampling step, training hits a hard wall: how do gradients flow through a random node?

The reparameterization trick has a clean answer — rewrite $z\sim p_\theta(z)$ as $z=g_\theta(\epsilon)$, isolating the randomness in a parameter-free noise variable $\epsilon$, so backprop can flow through $g_\theta$. The trouble starts with discrete variables: operations like $\arg\max$ are not differentiable. Gumbel-Softmax (a.k.a. the Concrete distribution) replaces the discrete sample with a tempered softmax over Gumbel-perturbed logits, giving you a smooth, differentiable surrogate that you can train end-to-end.

LLM Workflows and Application Architecture: Enterprise Implementation Guide

Sat, 21 Jun 2025 09:00:00 +0000

Most LLM tutorials end where the interesting work begins. They show you how to call a chat completion endpoint, attach a vector store, and wrap the whole thing in a Streamlit demo. None of that is wrong, but none of it is what breaks at 3 a.m. when 10,000 users hit your service at once and every other answer is a hallucination.

This article is about everything that comes after the demo. It is opinionated on purpose: production LLM systems are mostly plain distributed systems with one non-deterministic component bolted on, and most of the engineering effort goes into containing that non-determinism. We will work through seven dimensions — application architecture, workflow patterns, the RAG-vs-fine-tune decision, deployment topology, cost, observability, and enterprise integration — keeping each one short, concrete, and grounded in the levers that actually move the needle.

Symplectic Geometry and Structure-Preserving Neural Networks

Sat, 21 Jun 2025 09:00:00 +0000

Train a vanilla feedforward network to predict a one-dimensional harmonic oscillator. Validate it on the next ten time steps – the error is fine. Now roll it out for a thousand steps. The orbit no longer closes, the energy creeps upward, and what should be a periodic motion turns into a slow spiral. The network learned to fit data points; it never learned the physics. Structure-preserving networks fix this by baking geometric invariants – energy conservation, the symplectic 2-form, the Euler-Lagrange equations – directly into the architecture, so the learned model cannot violate them no matter how long you integrate.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Mon, 31 Mar 2025 09:00:00 +0000

Fine-tuning a 1.5B-parameter GPT-2 model for each downstream task means saving a fresh 1.5B-parameter checkpoint every time. Across a dozen tasks that is a substantial storage and serving headache, and it makes sharing a single base model essentially impossible. Prefix-Tuning (Li & Liang, 2021) takes the opposite stance: freeze every weight of the language model, and learn a tiny block of continuous vectors — the prefix — that is fed into the attention layers as if it were context the model already attended to. The model never changes; only the prefix does, and a different prefix produces a different “personality” on demand.

Vim Essentials: Modal Editing, Motions, and a Repeatable Workflow

Fri, 06 Dec 2024 09:00:00 +0000

Most people quit Vim because they try to memorize shortcuts. That is the wrong frame. Vim is a small language: learn the grammar – operator + motion – and you can express any edit without ever opening a cheat sheet again. This guide walks you through the 80% of Vim you will use daily, then shows how the remaining 20% composes naturally from the same handful of rules.

What you will learn

The single core idea: modes plus composable operations (operator + motion)
The handful of motions, text objects, and operators that cover almost everything
File operations, search & replace, macros, marks, registers
Buffers vs windows vs tabs – the mental model people most often get wrong
A minimal .vimrc and a one-week deliberate-practice plan to build muscle memory

Prerequisites

Any terminal (Vim ships with virtually every Unix-like system)
A willingness to feel slow for about a week

1. The core idea – modes plus a tiny grammar

MoSLoRA: Mixture-of-Subspaces in Low-Rank Adaptation

Sat, 12 Oct 2024 09:00:00 +0000

LoRA is the default tool for adapting a frozen base model: cheap, stable, mergeable, and good enough for most single-task settings. But the moment your fine-tuning data is genuinely heterogeneous – code mixed with math, instruction following mixed with creative writing, several domains in one adapter – a single low-rank subspace starts to feel cramped. You can grow $r$, but cost grows with it and you still get one subspace, just a fatter one.

Tennis-Scene Computer Vision: From Paper Survey to Production

Mon, 07 Oct 2024 09:00:00 +0000

A 6.7 cm tennis ball travels at over 200 km/h. Reconstructing its 3D trajectory from eight 4K cameras in real time, while simultaneously classifying what stroke each player is making, is a system problem that touches small-object detection, multi-view geometry, Kalman filtering, physics modelling, and human-pose estimation — all at once. This post walks the same path you’d walk at deployment time: state the constraints, survey the literature, choose, then build, and finally lay out a millisecond-by-millisecond budget for what runs in production.

HCGR: Hyperbolic Contrastive Graph Representation Learning for Session-based Recommendation

Sat, 16 Dec 2023 09:00:00 +0000

A user opens a sneaker app, taps “running shoes”, drills into a brand, then a price band, then a single SKU. That trajectory is a tree: each click narrows the candidate set roughly multiplicatively. In Euclidean space you need many dimensions to keep all the leaves of that tree apart, because Euclidean volume only grows polynomially with radius. In hyperbolic space volume grows exponentially with radius, so the tree fits naturally — a few dimensions are enough to keep the whole long tail untangled.

Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)

Sun, 15 Oct 2023 09:00:00 +0000

You have non-linear data and a linear algorithm. The kernel trick lets you run that linear algorithm on the non-linear data – without ever writing down the high-dimensional feature map. This guide builds the intuition first, then the math, then a practical toolkit you can ship.

What You Will Learn

The kernel trick: why it works and what it actually buys you
Mathematical foundations: positive-definite kernels, RKHS, Mercer’s theorem
Common kernels: RBF, polynomial, linear, Matern, periodic, sigmoid
Hyperparameter tuning: grid search, random search, marginal likelihood
Troubleshooting: overfitting, underfitting, numerical instability, scale
A kernel-selection decision tree for SVM, GP, and Kernel PCA

Prerequisites

Linear algebra basics (dot products, eigendecomposition)
Familiarity with SVM or Gaussian Processes (conceptual)
Python + scikit-learn

Why Kernel Methods Matter

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Wed, 20 Sep 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.

LAMP Stack on Alibaba Cloud ECS: From Fresh Instance to Production-Ready Web Server

Fri, 01 Sep 2023 09:00:00 +0000

You have a fresh ECS instance and SSH access. Your goal is a public website running Apache, PHP and MySQL. Between you and that goal sit three classes of problems that catch every beginner the first time:

Network reachability – packets are silently dropped at the cloud security group, the OS firewall, or the listening socket, and the symptom is the same in all three cases: nothing happens.
Service wiring – Apache, PHP and MySQL are three separate processes that have to find each other through file extensions, Unix sockets and TCP ports. Each interface has its own failure mode.
Identity and permissions – Apache runs as www-data, MySQL runs as mysql, files are owned by root after wget. The wrong combination produces 403, “Access denied”, or chmod 777 desperation.

This guide walks through all of them in the order you actually hit them on day one, then keeps going into the things that show up on day thirty: TLS, virtual hosts, backups, source compilation, and when to stop running everything on a single box.

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Sat, 26 Aug 2023 09:00:00 +0000

A plain autoencoder compresses and reconstructs. A variational autoencoder learns something far more useful: a smooth, structured latent space you can sample from to generate genuinely new data. That single change — making the encoder output a distribution instead of a vector — turns the network from a fancy compressor into a generative model with a tractable likelihood lower bound.

This guide walks the full path: why autoencoders fail at generation, how the ELBO derivation gets you to the loss function, why the reparameterization trick is the trick that makes everything trainable, a complete PyTorch implementation, and a tour of every common failure mode with concrete fixes.

paper2repo: GitHub Repository Recommendation for Academic Papers

Tue, 22 Aug 2023 09:00:00 +0000

You read a paper, want the code, and the “code available at” link is dead, missing, or points to a stub. Search engines fall back to keyword matching over the README, which works for popular repos with descriptive names and dies on everything else. paper2repo (WWW 2020) frames this as a cross-platform recommendation problem: learn one embedding space in which a paper abstract and a GitHub repository are directly comparable by dot product, then rank.

Session-based Recommendation with Graph Neural Networks (SR-GNN)

Thu, 13 Jul 2023 09:00:00 +0000

A user clicks A, B, C, B, D. A sequence model reads this as five tokens and folds them into a hidden state. SR-GNN sees a graph in which the edge B -> C survives even after the user returns to B, the node B is reused (so its in/out neighbours both inform its embedding), and the geometry of the click stream is preserved as adjacency. That structural insight is why SR-GNN (Wu et al., AAAI 2019) outperforms purely sequential baselines such as GRU4Rec and NARM on standard session-based recommendation (SBR) benchmarks.

Learning Rate: From Basics to Large-Scale Training

Mon, 13 Mar 2023 09:00:00 +0000

Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, learning rate is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.

Graph Contextualized Self-Attention Network (GC-SAN) for Session-based Recommendation

Sun, 15 Jan 2023 09:00:00 +0000

In session-based recommendation you only see a short anonymous click sequence – no user profile, no long history, no demographics. Every signal you have lives inside that single window. GC-SAN (IJCAI 2019) takes the strongest two ideas of the time – SR-GNN’s session graph and the Transformer’s self-attention – and stacks them: a graph view captures local transition patterns and loops, a sequence view captures long-range intent, and a tiny weighted sum decides how much of each to trust. The result is a clean “best of both worlds” baseline that is genuinely hard to beat at its parameter budget.

Lipschitz Continuity, Strong Convexity & Nesterov Acceleration

Tue, 27 Dec 2022 09:00:00 +0000

A surprising amount of “optimizer folklore” collapses into three concepts:

How steep is the gradient? Lipschitz smoothness ($L$-smoothness) caps the step size.
How sharp is the bottom? $\mu$-strong convexity sets the convergence rate and forces the minimizer to be unique.
Can we get there faster without losing stability? Nesterov acceleration and adaptive restart turn the per-condition-number cost from $\kappa$ into $\sqrt{\kappa}$.

This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer “what step size, what rate, is acceleration worth it?”

Optimizer Evolution: From Gradient Descent to Adam (and Beyond, 2025)

Fri, 09 Dec 2022 09:00:00 +0000

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.

LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

Sat, 26 Nov 2022 09:00:00 +0000

Session-based recommendation lives or dies on the click graph. New items have no edges. Long-tail items have a handful of noisy edges. Yet every item ships with a title and a description that the model never reads. LLMGR plugs that hole: treat the LLM as a “semantic engine” that turns text into representations a graph encoder can fuse with, then let a GNN do what it does best – rank. The headline result on Amazon Music/Beauty/Pantry: HR@20 up ~8.68%, NDCG@20 up ~10.71%, MRR@20 up ~11.75% over the strongest GNN baseline, with the largest uplift concentrated on cold-start items.

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

Fri, 05 Aug 2022 09:00:00 +0000

Stuffing pixels, audio, and video into a language model so it can “see,” “hear,” and reason – that was a research curiosity before CLIP landed in 2021. Today it’s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:

Alignment. How does the language model “understand” what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?
Task framing. The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn’t quietly drop several points of accuracy.
Cost. A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.
Evaluation. A model that scores 80 on MMBench can still hallucinate confidently on your customer’s invoice. Public benchmarks are the easy part.

This post follows the natural research arc – architecture, model families, downstream tasks, fine-tuning, evaluation, deployment – and tries to be specific enough at each stop that you can act on it. Less “what’s possible,” more “what to actually pick.”

Operating System Fundamentals: A Deep Dive

Mon, 01 Aug 2022 09:00:00 +0000

Open a terminal and type cat hello.txt. The instant you press Enter, at least seven layers of machinery wake up: bash parses the line, fork+execve launches the cat process, the kernel hands it a virtual address space, cat issues a read() syscall, the CPU traps into kernel mode, VFS dispatches to ext4, the block layer queues an NVMe request, the SSD DMA-writes the bytes back, an interrupt wakes cat, the bytes are copied through the page cache into the user buffer, and finally something appears on your screen.

Proximal Operator: From Moreau Envelope to ISTA/FISTA and ADMM

Mon, 25 Jul 2022 09:00:00 +0000

When your objective contains a non-smooth piece (sparse regularisation, total variation, an indicator of a constraint set) or a constraint that is hard to handle directly, “just do gradient descent” stalls – there is no gradient at the kink, or every step violates feasibility. The proximal operator is the engineered, beautiful workaround: think of each update as “take a step on the smooth part, then run a tiny penalised minimisation that pulls the iterate back toward a structured solution”.

Graph Neural Networks for Learning Equivariant Representations of Neural Networks

Fri, 22 Jul 2022 09:00:00 +0000

You can shuffle the hidden neurons of a trained MLP and get the exact same function back – but the flat parameter vector now looks completely different. This single fact ruins most attempts at “learning over neural networks”: naive representations treat two functionally identical models as two unrelated points in parameter space, and the downstream learner wastes capacity rediscovering a symmetry it should have for free. This paper – Graph Neural Networks for Learning Equivariant Representations of Neural Networks (Kofinas et al., ICML 2024) – proposes the clean fix: turn the network itself into a graph, then use a GNN whose architecture natively respects the relevant permutation symmetry.