System Design on Chen Kai Blog

System Design (8): Case Studies — URL Shortener, Chat System, News Feed

Sun, 27 Jul 2025 09:00:00 +0000

The best way to learn system design is to practice it. Reading about individual components — caching, queues, load balancers — builds your vocabulary, but designing a complete system is where you learn to compose those components into something that actually works.

This article walks through three classic system design problems end to end. Each follows the framework from the first article in this series: clarify requirements, estimate scale, design the architecture, deep dive into critical components, and identify bottlenecks.

System Design (7): Data Pipelines — Batch, Stream, and the Lambda Architecture

Thu, 24 Jul 2025 09:00:00 +0000

Every second, a large e-commerce platform generates thousands of data points: page views, search queries, add-to-cart events, purchases, inventory changes, price updates, and delivery status changes. This raw data is useless in its original form — scattered across dozens of services, stored in different formats, and arriving at unpredictable rates. The system that transforms this raw data into actionable insights — real-time dashboards, personalized recommendations, fraud detection alerts, business reports — is the data pipeline.

System Design (6): Microservices vs Monoliths — The Honest Tradeoff

Tue, 22 Jul 2025 09:00:00 +0000

In 2020, the team behind Segment — a customer data platform processing billions of events per month — published a blog post titled “Goodbye Microservices.” They had decomposed their monolith into over 140 microservices, and the result was not the engineering utopia they expected. Instead, they spent most of their time fighting the complexity of the distributed system itself: service discovery failures, cascading timeouts, inconsistent deployment pipelines, and an explosion of inter-service communication bugs. They consolidated back to a monolith and reported dramatic improvements in developer productivity and system reliability.

System Design (5): Message Queues and Event-Driven Architecture

Sat, 19 Jul 2025 09:00:00 +0000

In 2011, LinkedIn’s engineering team was struggling with a problem that many growing companies face. Their monolithic application had become a web of tightly-coupled services, each making synchronous calls to half a dozen others. When any single service went down, cascading failures rippled through the entire system. Deploying a change to one service required coordinating with every team whose service it called.

Their solution was Apache Kafka — a distributed event log that decoupled producers from consumers. Instead of Service A calling Service B directly, Service A writes an event to Kafka, and Service B reads it when it is ready. If Service B is down, the events wait. If Service B is slow, it processes at its own pace. The producer does not need to know or care about the consumer.

System Design (4): Caching — Where to Cache, What to Evict, and When Caching Hurts

Thu, 17 Jul 2025 09:00:00 +0000

There is an old joke in computer science that the two hardest problems are cache invalidation, naming things, and off-by-one errors. The joke works because cache invalidation really is that hard. But caching is also the single most effective technique for improving system performance. A well-placed cache can reduce latency by 100x, cut database load by 90%, and save thousands of dollars in infrastructure costs per month.

The trick is knowing where to cache, what patterns to use, and — critically — when caching will make your system worse instead of better.

System Design (3): API Design — REST, gRPC, GraphQL, and Choosing Wisely

Tue, 15 Jul 2025 09:00:00 +0000

In 2015, Facebook published a blog post introducing GraphQL, describing how their mobile app was drowning in REST API calls. A single news feed screen required data from posts, users, comments, likes, and media — each a separate endpoint, each returning far more data than the client needed. The over-fetching was killing mobile performance on slow networks. GraphQL was their solution, but it was not a universal solution.

Every API style exists because it solves a specific set of problems well, and every API style creates new problems. The skill is matching the right protocol to the right context.

System Design (2): DNS, CDN, and Load Balancing — The First Three Hops

Sat, 12 Jul 2025 09:00:00 +0000

In 2017, a single misconfigured DNS record at a major cloud provider took down a significant portion of the internet for several hours. Thousands of websites became unreachable — not because their servers were down, but because the system that translates domain names into IP addresses stopped working correctly. The incident was a stark reminder that the infrastructure we take for granted — DNS, CDN, load balancers — is the foundation everything else rests on.

System Design (1): Thinking in Systems — Load, Latency, and the Art of Estimation

Thu, 10 Jul 2025 09:00:00 +0000

A friend once asked me to help debug a performance problem. Their photo-sharing app worked fine in development but collapsed under production traffic. The database was melting, the API gateway was timing out, and users were seeing 504 errors. When I asked how many requests per second the system was handling, the answer was “I don’t know.” When I asked what the expected load was, the answer was “I didn’t think about that.”