Distributed Systems on Chen Kai Blog

Product Thinking (1): Architecture Design — From Monolith to Autonomous Agents

Sat, 30 May 2026 09:00:00 +0000

The Shape of a System#

Every architecture is a frozen argument. It records what you believed about the problem at the time you committed the code. Looking back across four systems I built over eighteen months — a marketing content platform (~70k lines TypeScript), a zero-dependency skill routing engine, an autonomous research agent (~315k lines Python), and a multi-model coding orchestrator — I can trace how my architectural instincts shifted. Not always forward. Sometimes sideways. But there is a clear progression: from “keep it in one process” to “let the agents govern themselves.”

System Design (8): Case Studies — URL Shortener, Chat System, News Feed

Sun, 27 Jul 2025 09:00:00 +0000

The best way to learn system design is to practice it. Reading about individual components — caching, queues, load balancers — builds your vocabulary, but designing a complete system is where you learn to compose those components into something that actually works.

This article walks through three classic system design problems end to end. Each follows the framework from the first article in this series: clarify requirements, estimate scale, design the architecture, deep dive into critical components, and identify bottlenecks.

System Design (6): Microservices vs Monoliths — The Honest Tradeoff

Tue, 22 Jul 2025 09:00:00 +0000

In 2020, the team behind Segment — a customer data platform processing billions of events per month — published a blog post titled “Goodbye Microservices.” They had decomposed their monolith into over 140 microservices, and the result was not the engineering utopia they expected. Instead, they spent most of their time fighting the complexity of the distributed system itself: service discovery failures, cascading timeouts, inconsistent deployment pipelines, and an explosion of inter-service communication bugs. They consolidated back to a monolith and reported dramatic improvements in developer productivity and system reliability.

Databases (7): Distributed Transactions — 2PC, Saga, and Why Consensus Is Hard

Sun, 28 Apr 2024 09:00:00 +0000

Everything we covered about transactions in Article 3 assumed a single database server: one machine, one transaction log, one lock manager. When your data spans multiple machines—through sharding, using microservices with separate databases, or replicating with strong consistency—you face the hardest problem in distributed systems: how do you get multiple machines to agree?

The Distributed Transaction Problem#

Consider an e-commerce system with separate services for orders and inventory, each with its own database:

Databases (6): Replication and Partitioning — Scaling Beyond One Machine

Fri, 26 Apr 2024 09:00:00 +0000

A single database server can handle a remarkable amount of load — a well-tuned PostgreSQL instance can serve tens of thousands of queries per second. But eventually you hit a wall. Maybe you need more read throughput than one CPU can provide. Maybe you need your data to survive a data center fire. Maybe your dataset exceeds what fits on a single disk. That is when you need replication and partitioning.