About

Chen Kai

Engineer · Writer · Building long-running agent systems @ Alibaba Cloud

chenkai66 chenkai.nb.666@gmail.com usually replies within 24h

4.8 yr writing & building

2 languages, each from scratch

1 problem, many disguises

24/7 systems never sleep

I build long-running agent systems that recover from their own failures — the kind that take over an entire workflow and run for days or weeks, not one-question-one-answer chatbots.

Underneath, it’s one set of problems: getting several models to divide work and check each other instead of grading their own homework; compressing each failure into something reusable; keeping shared memory from drifting across sessions; and closing the gap between a demo that works and a system you’d actually leave running overnight.

I also write technical articles, separately in Chinese and English — though that’s a byproduct.

The through-line

Marketing, research, games — different surfaces, one machine underneath

My work spans surfaces that look unrelated — cross-border marketing, autonomous research, game generation, agent control planes. To me they're one problem wearing different clothes: getting a crowd of models to divide labor and correct each other, compressing every failure into something reusable, and keeping a system running honestly for days or weeks without collapsing. Each new domain is the same foundation, tested once more against reality.

See the systems, one by one

The question I keep returning to

How do long-running systems stay resilient through failure, model swaps, cost pressure, and infrastructure migration?

Concrete levers: dynamic token budget across providers, compressing failures into reusable Skills, type-converging shared memory, and bridging the gap between demo and production-ready observability, replay, and trust.

Principles

A few principles I code by

Judgment firstRules exist to be broken — once you can say why. No principle outranks the specific call in front of you.
Docs firstIf you cannot write it down clearly, you have not thought it through. The design lands as prose first; code is one implementation of it.
No early abstractionThree similar lines beat one premature abstraction. Let the pattern surface on its own, then close it up.
Stability over flashRunning for a week without collapsing beats a dazzling demo. The hard part is the last mile, not the first screen.
Agents are systemsDo not treat an agent as a chat toy — design it as a long-running distributed system that fails, recovers, and audits itself.

Get in touch

If you have read this far, we probably think alike. Ideas, disagreements, or just a hello — write to me.

chenkai.nb.666@gmail.com See projects → Follow on GitHub