Series · Product Thinking · Chapter 1

Product Thinking (1): Architecture Design — From Monolith to Autonomous Agents

How my architectural thinking evolved from a single Next.js app to distributed autonomous agent systems — and the patterns that emerged along the way.

The Shape of a System#

Every architecture is a frozen argument. It records what you believed about the problem at the time you committed the code. Looking back across four systems I built over eighteen months — a marketing content platform (~70k lines TypeScript), a zero-dependency skill routing engine, an autonomous research agent (~315k lines Python), and a multi-model coding orchestrator — I can trace how my architectural instincts shifted. Not always forward. Sometimes sideways. But there is a clear progression: from “keep it in one process” to “let the agents govern themselves.”

Architectural evolution across four projects — each shape encodes a different assumption about state.

This is not a tutorial on distributed systems. It is a reflection on why I chose the structures I did, what broke, and what principles survived contact with production. The four systems span a wide range of complexity — from a single-file skill router to a 24/7 autonomous research pipeline — but they share a common thread: every architectural decision that aged well was grounded in a specific observable failure mode. Every decision that aged poorly was grounded in an imagined one.


Act I: The Monolith That Worked (AI4Marketing)#

AI4Marketing started as a weekend prototype: a Next.js app that called Qwen to generate marketing copy. Eighteen months later it has 32 API route directories (121 individual handlers), 21 database tables, a video production pipeline, a short-drama generation engine, a GEO optimization system, and a payment flow integrating Alipay, WeChat Pay, Stripe, and PayPal. The production TypeScript codebase is ~70k lines. It is still a single Node.js process managed by PM2 in fork mode — not even cluster mode. One process, one JavaScript thread, one event loop.

I did not plan a monolith. I planned to ship fast. The architecture is the residue of that velocity.

Why it stayed monolithic#

AI4Marketing is fundamentally a request-response application with long-running side effects. A user submits a content generation request; the system validates, deducts quota, calls an LLM, and returns results. The video pipeline can take fifteen minutes, but it is still triggered by a single HTTP request and writes its progress to the same PostgreSQL database. There is no inter-service communication because there are no services — just function calls within a process.

There was never a compelling reason to split. The “microservices would be cleaner” argument kept losing to “I can grep the entire codebase in one terminal.” When the video pipeline hangs, I check lib/video-pipeline-v2.ts (1,684 lines). When quota logic is wrong, I look at lib/quota-checker.ts (389 lines). When the payment webhook misbehaves, the handler is right there in app/api/webhook/alipay/route.ts. No service mesh, no message queue, no container orchestration. One process, one database, one deploy target, and pm2 restart ai4m.

The database schema is a star model with User at the center. Posts, VideoProjects, DramaProjects, Orders, and Subscriptions radiate outward as direct foreign keys — no event sourcing, no CQRS, no eventual consistency. When I need a user’s quota, I read one row. When I need their video projects, I join two tables. Boring technology. It works.

The star schema also makes quota enforcement simple. Every resource consumption — content generation, video rendering, drama production, GEO optimization — reduces to: read user.quotaUsed and user.quotaLimit, check the difference against the cost of the operation. One table, two columns, one comparison. The complexity lives in the conditional update, not the data model.

The patterns that made the monolith survivable#

The ~70k lines of TypeScript in AI4Marketing is organized into four tiers: app/api/ (route handlers), lib/ (domain logic), components/ (React UI), and prisma/ (schema and migrations). Each tier has a strict no-upward-import rule: domain logic never imports from route handlers, UI never imports from domain logic directly (it goes through route handlers). This is not enforced by tooling — it is enforced by discipline and code review. After 18 months, no violations have crept in. The architecture is stable.

What kept 121 routes from becoming unmaintainable spaghetti was discipline at the handler level. Every route follows an identical layered pattern:

Auth → Validation (Zod) → Rate Limit → Quota Check → Business Logic → Failure Refund

This is not a framework feature. It is a convention enforced by reading my own code until the pattern became muscle memory. The withMetrics higher-order function wraps every handler, recording http_request_duration_seconds and http_requests_total per route without requiring opt-in. Route labels are explicitly specified — never auto-derived from dynamic path parameters — to prevent cardinality explosion in the metrics store.

The quota system solves a real concurrency problem. The naive approach — read balance, check sufficiency, deduct — has a textbook TOCTOU race: two concurrent requests both read “balance = 1”, both pass the check, both deduct. The fix makes check and deduction atomic:

1
2
3
4
5
const claim = await tx.user.updateMany({
  where: { id: user.id, quotaUsed: { lte: effectiveLimit - pointCost } },
  data: { quotaUsed: { increment: pointCost } },
})
if (claim.count === 0) return { allowed: false }

This single WHERE clause eliminates the race. Two concurrent requests cannot both succeed — only one can match the condition at row level. PostgreSQL’s row-level locking guarantees it. Three lines to fix a class of bug that has cost production systems millions of dollars.

The rate limiter uses ten presets drawn from real production traffic patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
GENERATE:    { maxRequests: 10,  windowMs: 60_000   },  // 10/min
VIDEO:       { maxRequests: 5,   windowMs: 3_600_000 },  // 5/hour
ENHANCE:     { maxRequests: 20,  windowMs: 60_000   },  // 20/min
LOGIN:       { maxRequests: 5,   windowMs: 60_000   },  // 5/min
REGISTER:    { maxRequests: 3,   windowMs: 300_000  },  // 3/5min
QUERY:       { maxRequests: 60,  windowMs: 60_000   },  // 60/min
STATUS_POLL: { maxRequests: 120, windowMs: 60_000   },  // 120/min
SEARCH:      { maxRequests: 30,  windowMs: 60_000   },  // 30/min
ADMIN:       { maxRequests: 30,  windowMs: 60_000   },  // 30/min
WEBHOOK:     { maxRequests: 20,  windowMs: 60_000   },  // 20/min

These are in-memory sliding windows — no Redis dependency. The store is bounded: a cleanup job runs every five minutes and evicts oldest entries above 10,000. This works for a single-process app because the memory never migrates.

The graceful shutdown handler registers on SIGTERM, sets a shuttingDown flag to reject new requests, then polls the activePipelines set every two seconds. All pipelines drain within 10 minutes or the process force-exits. PM2’s kill_timeout matches. This means I can deploy without killing in-progress video renders — they finish before the process recycles.

This matters more than it sounds. AI4Marketing video renders take 8-15 minutes. Without graceful shutdown, a pm2 restart mid-render would kill the process, leave an orphaned render job in the database with status processing, and require manual cleanup. With graceful shutdown, the next deploy waits for the current render to finish. The 10-minute window is the maximum observed render time plus a 2-minute buffer. If it ever exceeds that, the process force-exits and the render is marked failed — which is better than hanging forever.

The hidden complexity: API key rotation#

One place the monolith does real distributed-system work is lib/api-key-manager.ts (483 lines). AI4Marketing makes hundreds of DashScope calls daily across sixteen keys split between CN and INTL regions. The rotation logic handles three failure modes differently: on 429 it immediately switches to the next key with no backoff; on other errors it retries with exponential backoff; on sustained failures it marks the key unhealthy and skips it for 30 minutes. The keys are also partitioned by region — a CN key fails fast on INTL endpoints rather than wasting a retry budget.

This is micro-distributed-systems thinking inside a monolith. The process boundary is fixed; the routing logic inside it gets surprisingly sophisticated.

The API key manager also handles the case where all keys are exhausted — either all marked unhealthy or all rate-limited — by returning a structured error that the caller can surface to the user (“Service temporarily unavailable; please retry in a moment”) rather than crashing or returning a cryptic provider error. Graceful degradation under total resource exhaustion is a design choice, not an afterthought. It has to be built in from the beginning because adding it later requires understanding every call site.

The video pipeline (lib/video-pipeline-v2.ts, 1,684 lines) is the most complex single file in the codebase. It tracks six phases — QUEUED, PREPARING, GENERATING, COMPOSITING, ENCODING, DONE — with explicit state recorded in the database after each phase completes. If the process crashes mid-pipeline, the next run reads the phase from the database and skips completed steps. This is a hand-rolled FSM inside a monolith — the same pattern that would later become the formal FSM in Research Agent.

The insight that connected AI4Marketing to later systems: explicit phase tracking is what separates a pipeline from a function call. A function call either completes or it doesn’t. A pipeline has intermediate states that survive process death, and those states must be durable and unambiguous.

The video pipeline also has a progress reporting mechanism: each phase writes its completion percentage to the database as a float, and the client polls /api/video/status?projectId=X to display a progress bar. The STATUS_POLL rate limit (120/min) exists specifically because early users would open 40 browser tabs each polling at maximum frequency, generating enough load to slow the actual video processing. The limit is high enough to support a smooth progress bar and low enough to prevent accidental self-DDoS. This calibration came from production observation, not prior reasoning.

AI4Marketing’s architecture is not interesting in the sense that it breaks new ground. It is interesting in the sense that it demonstrates how far disciplined conventions can take you inside a single process. The limit I hit was not code complexity — it was resource isolation. When one user’s video render chewed through all available memory, it degraded every other user’s requests too. In a monolith, you cannot give the video pipeline its own memory budget without giving it its own process. That is the moment I moved on.


Act II: Zero-Dependency as Architecture (DaaS)#

DaaS is a marketing-copy skill router. It receives a task description, matches it against a library of skill definitions, executes the right tool chain, and returns structured copy. It has zero npm dependencies in its runtime path. Not “few dependencies” — zero.

The constraint was deliberate. AI4Marketing’s package.json has 47 direct dependencies and the full node_modules directory occupies 1.2 GB. Every dependency is a supply-chain risk, an upgrade burden, and a potential source of runtime surprises. DaaS was built as a counter-experiment: what happens if you start from scratch with nothing?

The answer is more boilerplate and a radically simpler failure surface. Every function that DaaS calls is code I wrote. When something breaks, the stacktrace ends in my file, not in a transitive dependency I have never read.

The 1.2 GB node_modules in AI4Marketing is not just disk space — it is operational surface area. Each package can have its own bugs, security vulnerabilities, breaking changes in minor versions, and implicit assumptions about the runtime environment. I have hit all four categories: a Prisma minor version that changed how it handled timezone-naive dates, a Next.js patch that altered how it passed headers to route handlers, a sharp image processing library that assumed a specific libvips version on the host. Each incident cost hours. DaaS has had zero such incidents because it has zero third-party code that can change.

State as atomic files#

DaaS tracks ingest job state through a small pipeline: queued → processing → done → failed. The state is persisted as atomic JSON files — mv on Linux is atomic on the same filesystem, so a job is either in its old state or its new state, never in a corrupted intermediate. The routing cache is mtime-invalidated: if a skill definition file changes, the cached compiled form is stale on the next request.

This is the simplest possible persistent state that is also correct. No ORM, no migration runner, no schema file. The “database” is a directory of JSON files.

The mtime-invalidation pattern is worth dwelling on because it solves a cache invalidation problem with zero infrastructure. The classic approach is to use a cache with a TTL: if the entry is older than N seconds, re-fetch. TTL caches have an inherent tradeoff — short TTL means frequent re-computation, long TTL means stale data. Mtime-invalidation has no tradeoff: the cache is always exactly as fresh as the underlying file. The only requirement is that writes to the source file are atomic, which mv on the same filesystem guarantees. DaaS’s routing cache recompiles a skill definition exactly once per modification, with zero lag and zero staleness. TTL-based caches cannot match this.

The god-file problem#

DaaS started with a single server.py that handled routing, parsing, execution, caching, logging, and health checks. At around 3,500 lines it became genuinely dangerous — changes to the routing logic required understanding the execution layer, changes to caching required understanding both. The coupling was real and the test surface was opaque.

The fix was mechanical: extract each concern into its own module, enforce that cross-module imports go through a single public interface, and make server.py a pure HTTP dispatcher. The result was 17 extracted modules and server.py dropping to ~1,200 lines of route dispatch and orchestration. No new features. Just structure.

The lesson: zero-dependency does not mean zero structure. Structure must be self-imposed precisely because no framework imposes it. And the extraction must happen before the god-file becomes too entangled to safely refactor.

The god-file problem is symmetric: both zero-dependency and heavy-dependency codebases accumulate it. In heavy-dependency projects, the god-file often hides behind framework magic — the framework stitches together disparate concerns so you never notice the coupling until you try to change one piece and discover it pulls on six others. In zero-dependency projects, the coupling is nakedly visible from the first day, which makes it easier to address early but also easier to ignore because there is no red squiggle from a linter. The discipline has to come from you.

DaaS as a forcing function#

DaaS forced me to answer questions I had deferred in AI4Marketing by hiding them behind libraries. How do you implement a streaming HTTP response without a framework? How do you parse a JSON body with only Node built-ins? How do you retry a function with exponential backoff without a retry library? The answers are not hard — they are 20-40 lines each — but writing them made me understand what the libraries were actually doing.

The side effect: when something breaks in DaaS, I have a mental model of the entire stack. There are no black boxes. In AI4Marketing, I have replaced maybe 30% of the library surface with that same understanding. The rest is still opaque, which is fine for stable dependencies that I trust. The distinction is: I know which parts I trust and why.

DaaS also taught me that the zero-dependency constraint creates a useful filter for feature creep. If adding a feature requires pulling in a library, I have to seriously justify it. Most of the time the feature either gets implemented in 50 lines of plain code or gets cut. The constraint is productive friction.

The one place DaaS makes a real tradeoff is around testing. No testing framework means no describe/it blocks, no beforeEach/afterEach, no test runner progress bars. Tests are just functions that throw on failure. This is fine — actually fine, not “acceptable” — because the tests are simple enough that a plain function call is sufficient. A test that needs a framework is probably testing too many things at once. The zero-dependency constraint applies to tests too, and it enforces the same discipline: small, focused, single-responsibility functions that are easy to test with five lines of assertion code.

DaaS and AI4Marketing represent two poles: one with everything, one with nothing. Research Agent needed something between them — a system with enough structure to manage genuinely complex multi-agent workflows, but without the weight of a full application framework that would impose conventions designed for request-response systems, not autonomous pipelines. The middle path turned out to be the FSM.


Act III: State Machines as the Control Plane (Research Agent)#

Research Agent is where my architecture took its largest conceptual jump. It is fully autonomous: it reads papers from arXiv, builds a knowledge graph (65,000+ nodes, 197,000+ edges in SQLite), generates research ideas, filters them through three rounds of adversarial debate, designs experiments, dispatches execution to worker nodes, runs statistical analysis, and writes papers. It runs 24/7 with no human intervention on a 4-vCPU, 7.5 GB main server with two 128 GB compute workers. The Python codebase is ~315k lines across ~200 source files, grown through ten months of autonomous operation and incremental patching.

The scale of the codebase is a direct consequence of the architecture. The coordinator alone (coordinator.py) is 1,264 lines. supervisor_lib/heal_rules.py is 2,279 lines. framework/dispatch_decision.py is 658 lines. These files are large not because the code is sloppy, but because the domain genuinely requires managing a large number of distinct states and transitions. A 40-rule heal file is the price of 40 distinct failure modes that the system can now handle without human intervention.

The core insight that made this tractable: build an explicit finite-state machine for every entity lifecycle, enforced at the data layer.

Why the FSM#

Before the FSM, the coordinator used a 15-level if/elif chain to decide what to do next. It checked flags: “if there’s an idea with status approved and no protocol.json, dispatch the experiment designer.” This worked at three states. By twelve states with failure paths, retries, and infrastructure errors, it became a combinatorial explosion. Real bugs appeared: an experiment would get stuck in executed forever because the condition for dispatching the statistician checked that analysis.json didn’t exist — but a crash mid-analysis left a partial file that satisfied the existence check without containing valid results.

The deeper problem with flag-checking is that it encodes state implicitly in the combination of multiple fields. “The experiment is done” means “status is executed AND analysis.json exists AND the file is parseable AND the file contains a decision field.” That is four conditions, each of which can fail independently, and failure in any one leaves the system in an undefined state that no single check detects. The FSM compresses this into one value: protocol.status == 'analyzed'. If it is analyzed, all four conditions have been verified at transition time. If it is not, you know exactly which step failed and when from the history log.

The FSM (framework/pipeline_state.py, 222 lines) eliminates this whole class of bug by making transitions explicit and exhaustive:

Idea lifecycle:
  proposed → approved → experiment_designed → experiment_completed
      |          |              |                        |
      |          |              └─ negative_result       └─ paper_written
      |          └─ killed                                       |
      └─ merged / subsumed / failed                       paper_reviewed

Protocol lifecycle:
  designed → running → executed → analyzed
      |          |          |           |
      |          └─ failed  └─ failed   └─ pending_failure_analysis
      └─ needs_redesign                 └─ needs_redesign / needs_rerun
      └─ abandoned / dataset_failed     └─ closed_negative

Every transition(entity, from_state, to_state) call validates that the move is legal, appends a timestamped entry to .history.jsonl, and raises InvalidTransition on illegal moves. You cannot move an idea from proposed directly to paper_written — it must pass through design, execution, and analysis. The FSM replaces 35+ scattered proto['status'] = ... assignment sites. That alone was worth the refactor.

The FSM also serves as documentation. The IDEA_TRANSITIONS and PROTOCOL_TRANSITIONS dictionaries in pipeline_state.py are the authoritative specification of what the system can do. When I want to understand whether an idea can move directly from approved to paper_written, I look at IDEA_TRANSITIONS['approved']. The answer is no — only experiment_designed, killed, and failed are reachable from approved. This is not in a README somewhere. It is enforced at runtime and the enforcement is the documentation.

The dispatch decision layer#

The Boss Agent is a ReAct LLM orchestrator that decides what to dispatch. But it does not decide from scratch — it evaluates nine structured predicates from framework/dispatch_decision.py (658 lines):

1
2
3
4
5
6
7
@dataclass
class Decision:
    dispatch: bool
    reason: str
    artifact_required: str  # path the agent MUST produce
    priority: int           # higher = dispatched first
    target: str             # idea/experiment ID

Each predicate (e.g., should_dispatch_experimenter(experiment_id)) reads the FSM state, checks preconditions (protocol exists, no active lock, slot available), and returns a structured decision. The Boss reads all nine, sorts by priority, dispatches the top one.

The priorities encode an implicit scheduling policy derived from measured bottlenecks:

1
2
3
4
'statistician':    200   # analysis unblocks writer + reviewer + formatter
'failure_analyst': 150   # failure analysis triggers redesign, unblocks experimenter
'merger':           85   # pre-debate deduplication
'experiment_designer': 70  # starvation-boosted if queue empties

These numbers live in a JSON file that hot-reloads every 30 seconds. I can reprioritize the entire system’s behavior without restarting any process. More importantly, when something fails to dispatch, the reason field tells you exactly why: “Protocol not yet designed” or “All experimenter slots occupied (3/3)” or “Target locked by PID 12847.” There is no mystery. Every non-action has an auditable explanation.

Fleet: distributed execution via HTTP polling#

The main server has 7.5 GB of RAM. LLM-driven experiments — loading large datasets, making hundreds of API calls — would OOM it immediately. The fleet architecture solves this with radical simplicity: workers poll the main server every 5 seconds for commands, execute locally, and POST results back. Workers also POST heartbeats every 60 seconds with system metrics (load, memory, disk, running agent count). A separate sync loop pushes generated artifacts — papers, KG patches, experiment data — every 60 seconds.

That is it. HTTP polling. No WebSocket, no gRPC stream, no message broker. A worker can go offline and come back — it resumes polling. The main server tracks health by heartbeat freshness: stale beyond 600 seconds, the worker is marked unselectable and its dispatched tasks are reaped by the orphan dispatch reaper. The entire fleet server is 680 lines of Python with four HTTP endpoints. I can debug any dispatch failure with curl.

The four endpoints are /heartbeat (POST, worker reports status), /commands (GET, worker polls for work), /result (POST, worker returns output), and /dispatch (POST, coordinator queues work for a specific worker). That is the complete API surface. Adding a new worker is: clone the code, set the fleet server URL and secret, run the worker. The worker self-registers via its first heartbeat.

Why not Celery or Temporal? Because those systems are themselves complex distributed systems requiring Redis or RabbitMQ, failure modes I would have to learn, and operational surfaces I would have to monitor. My fleet protocol has no dependencies except the Python standard library. Its failure modes are exactly as complex as I made them.

The two 128 GB compute workers exist because LLM experiments sometimes need to load datasets that are 40-60 GB in memory. The main server cannot do that. The workers can. The fleet protocol is the narrow bridge between the coordinator’s decision-making (7.5 GB, state-heavy) and the workers’ raw execution capacity (128 GB, stateless). This asymmetry — a small smart brain coordinating large dumb muscles — is a recurring pattern in distributed systems that I keep rediscovering.

Self-healing: 40+ rules for automatic recovery#

On a 7.5 GB machine running six systemd services — coordinator, pipeline, dashboard, supervisor, kg-merger, dingtalk-listener — OOM kills are not theoretical. They happen weekly. The architecture does not try to prevent OOM; it assumes OOM and designs for recovery.

OOM priority triage. The pipeline gets OOMScoreAdjust=+500 (Linux sacrifices it first). The coordinator gets -500 (protected). When memory pressure hits, the pipeline dies, not the coordinator. Since all pipeline state lives on disk, the pipeline restarts and resumes from where it was.

Memory watchdogs. Both the coordinator and pipeline have background threads that read /proc/self/status every 2–3 minutes. The coordinator triggers aggressive GC at 1,500 MB swap and force-exits at 2,400 MB RSS+swap. The pipeline force-exits at 1,900 MB. Force-exit is deliberate: restarting from a known state is cheaper than limping along with degraded memory.

Supervisor reconciliation. supervisor_lib/heal_rules.py (2,279 lines) contains 40+ rules that fire every minute. Rule 1: if analysis.json exists but status is still executed, force transition to analyzed. Rule 2: if failure_analysis.json has a decision, apply it. Rule 3: if redesign count exceeds 5, kill the idea (backstop against infinite redesign loops). Rule 4: if needs_redesign has been stuck 24+ hours with 5+ dispatch attempts, set abandoned.

Infrastructure error triage. If failure analysis finds “timeout” or “429” or “404” in the root cause, it resets the protocol to designed for retry rather than triggering a full redesign. Transient errors should not burn a redesign cycle.

Stale lock reaping. Locks use fcntl.flock() with PID and timestamp written inside. The supervisor checks whether the PID is still alive; dead PIDs get their locks removed. Locks older than 4 hours are force-removed regardless.

The invariant checker (framework/invariant_checker.py) runs after every boss cycle and enforces global consistency: status/artifact alignment, no duplicate dispatches, no orphaned locks. Violations are logged at ERROR level and written to data/invariant_violations.json. Even if a bug introduces an inconsistency, the next boss cycle detects and flags it.

What the FSM actually replaced#

Before framework/pipeline_state.py existed, the coordinator had 35+ sites of direct proto['status'] = 'some_string' assignment scattered across different files. There was no validation, no audit trail, and no enforcement of legal transitions. An experiment could move from designed directly to analyzed if a bug in one code path accidentally called the wrong update function.

The FSM refactor took two days: writing the transition tables, replacing all 35 assignment sites with transition() calls, and letting the audit log fill in. The first thing I discovered was three bugs: an idea that had been silently stuck in proposed for nine days because a merge check was reading approved instead of proposed, a protocol that had two simultaneous status values because two processes had both written to it, and a failure analysis that had been applied twice because the transition guard had a race condition.

The FSM did not prevent these bugs — they had already happened. But it made them visible. The audit log showed exactly when each illegal state occurred, what code path caused it, and which transitions had been attempted. That is the real value of explicit state: not that it prevents all errors, but that it makes errors observable and diagnosable.

Research Agent solves the coordination problem for a fixed set of agent types with a fixed lifecycle. Every idea goes through the same FSM; every experiment goes through the same protocol lifecycle. The challenge Elevator addresses is different: the task structure is dynamic. A coding goal has a different subgoal graph than a debugging goal, which has a different graph than a refactoring goal. You cannot predefine the FSM because you do not know the states until you see the goal. The DAG replaces the FSM as the coordination primitive.


Act IV: DAG Decomposition for Complex Tasks (Elevator)#

Elevator orchestrates multiple Chinese LLMs — Qwen, DeepSeek, Kimi, GLM, MiniMax — through a structured pipeline: goal decomposition into a subgoal DAG, parallel batch execution, cross-model verification, and experience learning. All models are accessed via DashScope’s unified OpenAI-compatible API; model switching is a string change.

The DAG as execution plan#

When a user submits a goal — “Build a WeChat mini-game idle RPG” — the planner (planner.py, 474 lines) converts it into a structured JSON DAG. Before generating the plan, it injects four context sources: global lessons (past failures to avoid), skills (successful plan templates from prior tasks), workspace state (current files), and project memory (cross-milestone decisions). The resulting DAG for a real project:

m1:  Canvas framework + game loop        (no deps)
m2:  Character system + stats            (no deps)
m3:  Configuration-driven architecture   (depends: m1, m2)
m4:  Combat system                       (depends: m3)
m5:  Inventory + equipment               (depends: m3)
m6:  Cultivation progression             (depends: m4)
m7:  NPC + dialogue system               (depends: m3)
m8:  Map + exploration                   (depends: m7)
m9:  UI polish                           (depends: m5, m6)
m10: Art assets                          (depends: m9)
m11: Sound effects                       (depends: m10)
m12: Save/load system                    (depends: m8)
m13: Performance optimization            (depends: m11, m12)
m14: Final QA + packaging                (depends: m13)

The runtime executes this via topological batching: m1 and m2 in parallel (no deps), then m3 waits for both, then m4/m5/m7 in parallel (depend only on m3), and so on. A ThreadPoolExecutor manages parallel execution. Wall-clock time equals the critical path, not the sum of all milestones.

DAG validation is non-trivial because LLMs produce invalid plans. The planner performs cycle detection, dangling reference removal, and ID uniqueness enforcement before accepting a plan. An optional critique step sends the plan to a cross-family model; if critical issues surface, revise_plan() rewrites it (bounded by MAX_CRITIQUE_ROUNDS).

The cycle detection matters more than it sounds. Early Elevator plans sometimes had circular dependencies — “m3 depends on m5 depends on m3” — because the LLM was thinking about logical relationships rather than strict build ordering. A topological sort would loop forever on a cyclic graph; the planner detects the cycle before the sort and forces a replan. This is one of the few places where LLM output validation has a formal correctness criterion: a DAG is either acyclic or it is not. No fuzzy scoring required.

Cross-family verification: the anti-sycophancy pattern#

Elevator’s most consequential architectural decision: no model evaluates its own output. The execution pipeline enforces this structurally:

Executor (qwen3.6-plus) produces code
    │
    ▼
Reviewer (deepseek-v4-pro) checks for bugs, edge cases, test gaps
    │
    ├─ issues found → inject feedback, loop back to executor (max 2 rounds)
    └─ no issues → proceed to verification
        │
        ▼
Verifier (qwen3.6-plus, different instance) checks each criterion independently
    │
    ├─ low confidence → ensemble panel (3 models, majority vote)
    ├─ all criteria pass → git commit, mark done
    └─ criteria fail → escalate model tier, retry

Why cross-family? Models within the same family share training data, architecture biases, and failure modes. When Qwen produces buggy code, another Qwen instance often misses the bug — it has the same blind spots. DeepSeek’s MoE architecture processes information differently enough to catch errors that dense transformers miss. I discovered this empirically: within-family review had a ~15% miss rate on bugs that cross-family review caught.

The most common class of bugs caught by cross-family review are off-by-one errors in array indexing and incorrect boundary conditions in loop termination — precisely the kinds of errors that look plausible on a surface read. A model that trained on similar code patterns as the one that wrote the bug will tend to read the buggy version as correct. A model with a different architectural inductive bias reads it fresh and notices the boundary condition is wrong. The cost is real: review adds one full LLM call per subgoal. For a 14-milestone project, that is 14 extra expensive calls. The benefit is catching bugs that would otherwise propagate through dependent milestones and require expensive rewrites later.

The cross-family verification pattern also revealed something about how I think about model selection. I had been treating model choice as a capability question: “which model is smartest?” The verification architecture reframes it as a diversity question: “which models have the most different failure modes?” A dumber model from a different family is more valuable as a reviewer than a smarter model from the same family. DeepSeek v4-pro is not the cheapest or the fastest model I use. It is the most architecturally different from Qwen, which is why it is the cross-family reviewer of choice.

Progressive escalation: cost optimization as reliability#

The three-tier escalation path is designed so 80% of tasks complete on the cheapest model:

Attempt 0: qwen3.6-plus        ( 4 CNY/M in,  12 CNY/M out) — handles most tasks
Attempt 1: qwen3.6-max-preview (20 CNY/M in,  60 CNY/M out) — harder problems
Attempt 2: deepseek-v4-pro     ( 4 CNY/M in,  16 CNY/M out) — cross-family fallback

Attempt 0 to attempt 1 is a 5x cost jump. Model capability follows a power law for task difficulty: most coding tasks are straightforward and the cheap model handles them. Only the tail — architectural decisions, subtle concurrency bugs, complex algorithm implementations — needs the expensive model. When the expensive model from the same family also fails, switching families often succeeds because the failure mode was family-specific.

Complexity is estimated before execution: keywords like “fix”, “typo”, “rename” classify as light (12 turns max); “architecture”, “refactor”, “full-stack” classify as heavy (35 turns max). This prevents burning expensive model budget on trivial work.

Tripwires: behavioral anomaly detection#

The executor runs an agent loop — an LLM repeatedly choosing and invoking tools until it declares the task complete. This loop fails in characteristic ways: infinite read loops, repeated identical tool calls, or premature submission without running tests.

The tripwire system (tripwires.py, 309 lines) monitors in real-time:

  • shallow_no_test: Code written but no pytest or bash ever run? Block submission.
  • repeat_loop: Same tool call repeated N times? Warn, then force a different action.
  • no_progress_streak: Multiple turns with no file writes? Inject a reflection prompt.
  • explore_overload: Extended reading without producing code? Prompt to start writing.
  • consecutive_read_only: N+ consecutive read-only turns? Force action.

Escalation path: warn → force_action → auto-submit (after 3 forced actions). This prevents the most expensive failure mode: an agent loop burning 35 turns of expensive model calls while accomplishing nothing.

The tripwire system is the Elevator equivalent of Research Agent’s self-healing rules. Both systems face the problem of autonomous agents that can get into unproductive states without external intervention. Research Agent’s agents are long-running background processes; Elevator’s are short foreground loops. The solution is the same: explicit behavioral contracts, monitored in real-time, with automatic corrective action that escalates from gentle nudges to hard stops.

Skill evolution: accumulated institutional knowledge#

After every successful task, Elevator distills a “skill” — a plan template abstracted from the specific task. Concrete file paths, variable names, and API endpoints are stripped; structural patterns are retained. Skills follow a lifecycle:

candidate (score 0.5) → validation gate → active (score grows on reuse)
    │                                           │
    └─ rejected → archived                      ├─ fail_streak ≥ 3: needs_review
                                                ├─ score < threshold: archived
                                                └─ unused 180+ days: archived

Retrieval uses dual strategies: semantic (embedding cosine similarity) and keyword (overlap between query keywords and skill context keywords). Higher-scoring skills rank above lower-scoring ones. Only active skills are visible to the planner.

When the skill store exceeds a threshold, similar skills (Jaccard coefficient ≥ 0.3 on keyword sets) are clustered and merged into aggregate meta-skills. The system retains what works generally and forgets what worked only once.

The skill system is Research Agent’s knowledge graph applied to the execution layer. Research Agent builds a graph of scientific concepts and their relationships; Elevator builds a graph of engineering patterns and their applicability scores. Both are trying to solve the same underlying problem: how does a system avoid making the same mistake twice? Research Agent uses FSM-audited history to know what has been tried. Elevator uses scored skill templates to know what has worked. The shapes are different. The intent is identical.

In practice, the skill system means that Elevator’s second attempt at a game idle RPG is materially better than the first. The planner pulls the skill template from the first project and uses its structure as a starting point. The architecture choices that worked — configuration-driven design, separation of the combat engine from the UI layer, save/load as a pure serialization concern — are baked into the retrieved skill. The mistakes — a combat system that had a tight dependency on the character system’s internal data structure — are annotated in the lesson store and injected as negative guidance. The system accumulates institutional memory without human curation.

Elevator’s skill evolution lifecycle — from candidate through active to archived.


The Principles That Emerged#

Looking across these four systems, certain patterns recur regardless of language, framework, or problem domain. They are not best practices from textbooks — they are conclusions I reached by making specific mistakes and then finding specific fixes. The textbooks confirm them in retrospect, which is reassuring but not the source.

1. State machines are not just for network protocols#

Every system that manages multi-step workflows benefits from an explicit FSM. AI4Marketing’s video pipeline tracks six phases. Research Agent’s idea lifecycle has 15 legal states with audited transitions. Elevator’s task/subgoal/milestone hierarchy is a nested three-level state machine. DaaS tracks ingest job states through pipeline phases.

The alternative — checking boolean flags and timestamps to infer state — is the single largest source of bugs I have encountered across all four systems. “Is this experiment done?” should be answered by protocol.status == 'analyzed', not inferred from file existence and modification times.

A concrete example from Research Agent: before the FSM, I spent an afternoon debugging an experiment that had been stuck for three days. The protocol had a failure_analysis.json file written by an agent that crashed halfway through. The coordinator’s “done” check was os.path.exists('failure_analysis.json') — so it marked the experiment complete. But the file was malformed and contained no usable decision. The FSM fix: transition() to analyzed requires not just that the file exists, but that it parses and contains a decision field with a valid value. File existence is not state. State is state.

2. The monolith earns its keep until lifecycle boundaries emerge#

AI4Marketing at 121 routes is still a monolith and still the right choice. The moment I needed multi-step, multi-agent orchestration with independent failure domains and independent resource budgets (Research Agent), the monolith became untenable. The inflection point is not lines of code, team size, or throughput — it is whether different parts of the system need independent lifecycle management: startup, shutdown, crash recovery, resource allocation, deployment cadence.

Research Agent’s coordinator and pipeline are separate processes because they need different OOM priorities, different memory budgets, and different restart behaviors. The coordinator manages state and is protected from OOM; the pipeline executes LLM calls and is sacrificed first. Those requirements are genuinely incompatible inside a single process. If the coordinator could tolerate being killed as freely as the pipeline, they would be one process. They cannot. So they are two.

The same logic applies in the other direction: AI4Marketing has a video pipeline that takes 15 minutes and a quota check that takes 20 milliseconds. Both live in the same process. This is fine because they share the same lifecycle requirements — same database, same deployment, same crash recovery behavior, same memory budget. The video pipeline does not need to be a separate service just because it takes longer. It needs to be a separate service only if it needs a separate lifecycle. Right now, it does not.

3. Zero-dependency is a form of reliability engineering#

DaaS has been running for months with zero dependency-related incidents. No npm audit alerts, no upstream breaking changes, no mystery crashes in someone else’s event loop. The cost is more boilerplate. The benefit is that every failure mode is my own code — debuggable, fixable, and predictable.

This is most valuable under time pressure. When something breaks in production at 11 PM, the last thing I want is a stack trace that ends in node_modules/some-library/dist/index.js:347. I want a stack trace that ends in lib/my-code.ts:89. The former requires understanding someone else’s code under pressure. The latter requires understanding my own code, which I already wrote and presumably understand. Zero-dependency is risk management for debugging sessions.

The complementary lesson is that dependency-free is not always right. DaaS is simple enough to implement dependency-free. AI4Marketing is not — the Prisma ORM alone handles PostgreSQL connection pooling, migration running, and type-safe query generation in ways that would take months to reimplement correctly. The question is not “should I have dependencies?” but “which dependencies am I willing to trust completely, and which parts of my system do I want to fully own?”

4. Dispatch predicates beat priority queues#

Both Research Agent and Elevator use declarative predicates (“should this happen? why or why not?”) rather than imperative scheduling (“do this next”). Predicates are composable (add new ones without modifying existing ones), testable (unit test each in isolation), and debuggable (read the reason field when dispatch fails). Priority queues tell you what ran. Predicates tell you why something did not run.

The practical benefit showed up when I was debugging a Research Agent stall: four approved ideas, zero dispatches for 12 hours. With a priority queue I would have seen an empty queue and no further information. With predicates, I read nine reason fields: all nine said “All experimenter slots occupied (3/3).” Three experiments were locked by PIDs that no longer existed — the workers had died without releasing their locks. The lock reaper had not fired because the lock timestamps were 3.5 hours old, just under the 4-hour force-removal threshold. The reason field gave me the diagnosis in 30 seconds. Fixing the threshold to 2 hours took another 30 seconds. Total debug time: under two minutes. A priority queue would have given me no signal at all.

5. Self-healing is not optional for autonomous systems#

Research Agent runs without human intervention for weeks at a time. This is possible only because it assumes failure as a normal operating condition: memory watchdogs, stale lock cleanup, state/artifact reconciliation, invariant checking after every boss cycle, and 40+ auto-repair rules. An autonomous system without self-healing is a system that breaks silently and accumulates invisible state corruption.

The distinction that matters is between reactive healing (detect a broken state, fix it) and preventive healing (detect conditions that will lead to broken state, fix them first). Research Agent does both. The invariant checker is reactive: it runs after every boss cycle and fixes what it finds broken. The memory watchdog is preventive: it detects memory pressure before OOM and exits cleanly rather than being killed chaotically. The lock reaper is both: it cleans up dead locks (reactive) and checks heartbeat freshness to mark workers unselectable before they accumulate more work (preventive).

The 40 rules in heal_rules.py are not clever code. They are each a specific failure mode I encountered, wrote down, and automated. Rule 1 exists because I once found an experiment stuck in executed for six days. Rule 3 exists because an idea once redesigned seventeen times before I realized the protocol was fundamentally incompatible with the available dataset. Rule 4 exists because needs_redesign can be a sink state if the designer keeps getting killed before it can write a new protocol. Each rule has a comment explaining the bug that prompted it.

6. Cross-family verification prevents model hallucination laundering#

If the same model family both produces and evaluates output, errors compound rather than cancel. The model has systematic blind spots it cannot detect in its own output. Elevator’s cross-family verification caught real bugs that within-family review missed. The cost is double the API spend on review; the benefit is catching bugs before they compound through dependent subgoals.

7. The architecture reflects the operator, not just the problem#

Looking at these four systems together, I notice they encode specific anxieties. AI4Marketing is obsessed with atomicity — every quota operation is an atomic conditional update, every video phase is durably recorded, every shutdown is graceful. Research Agent is obsessed with observability — every state transition is audited, every dispatch has a logged reason, every anomaly writes to an invariant log. DaaS is obsessed with surface area reduction — fewer dependencies means fewer failure modes to understand. Elevator is obsessed with error compounding — cross-family review exists specifically because I have seen how bugs cascade through dependent subgoals.

These are not just technical choices. They are answers to the question “what keeps me awake at 3 AM?” Quota races kept me awake in early AI4Marketing. Opaque FSM state kept me awake in Research Agent. Unknown library behavior kept me awake in earlier projects. The architecture is a record of which failures I have personally experienced and promised myself not to experience again.

This is probably why architecture advice from other engineers is so often unsatisfying. The advice is correct for the anxieties the adviser has experienced. It may not address the anxieties that are specific to your system and your production history.

The flip side: the patterns I enforce most rigidly are the ones I once violated and paid for. The rate limiter cleanup job in AI4Marketing (evict entries above 10,000) exists because I once saw the rate limiter leak memory until the process was killed. The 4-hour lock TTL in Research Agent exists because I once had a system that held locks forever after a worker crash and never recovered. The shallow_no_test tripwire in Elevator exists because I once watched an LLM write 300 lines of convincing-looking code that never executed a single test. Architecture is scar tissue with better naming.


The Meta-Pattern#

If I had to compress everything above into a single architectural principle:

Make the system’s state explicit, its transitions auditable, and its recovery automatic. Everything else is optimization.

AI4Marketing makes quota state explicit with atomic conditional updates. DaaS makes routing state explicit with mtime-invalidated caches and atomic file persistence. Research Agent makes research progress explicit with FSM transitions audited to history files. Elevator makes task execution explicit with DAG batches and structured verification checkpoints.

The systems that cause me the least operational pain are the ones where I can answer “what is happening right now?” by reading a single file or querying a single endpoint. The ones that wake me up at 3 AM are the ones where state is implicit — scattered across log files, inferred from timestamps, or encoded in the absence of other state (“if this file does not exist, we must be in phase 3”).

The progression across the four systems is not really monolith → microservices → distributed agents. It is better described as a progression in where state lives and how explicitly it is managed:

  • AI4Marketing: state in PostgreSQL rows, managed via conditional SQL updates, queried by direct foreign-key joins.
  • DaaS: state in JSON files, managed via atomic filesystem operations, invalidated by mtime comparison.
  • Research Agent: state in JSON files with .history.jsonl audit logs, managed via an FSM that rejects illegal transitions.
  • Elevator: state in a JSON DAG with explicit phase tracking per node, managed by a coordinator that enforces topological ordering.

Each step is more explicit than the last. The complexity did not go down — Research Agent is dramatically more complex than AI4Marketing. But the state became easier to reason about at each step, even as the system grew more sophisticated. That is the real payoff of explicit state: you can add capability without adding proportional opacity.

Explicit state also makes onboarding easier, in the sense of “onboarding your future self to the system six months from now.” When I return to Research Agent after two weeks away, I can answer “what is the system doing right now?” in about 90 seconds: read the coordinator’s last boss cycle log, check data/invariant_violations.json, look at the active locks. Without the FSM and the audit logs, reconstructing that picture would take an hour of log spelunking. The 222 lines of pipeline_state.py save me hours every month.

Architecture is not about choosing between monolith and microservices, SQL and NoSQL, sync and async. Those are implementation details that follow from a more fundamental choice: where does state live, how does it change, and who is allowed to change it? Answer those three questions with clarity and discipline, and the rest designs itself.

The secondary question — one I kept getting wrong in early iterations — is: what are the failure modes of your state management, and can the system detect and recover from them automatically? PostgreSQL’s transactions handle AI4Marketing’s failure modes. DaaS’s atomic file operations handle its own. Research Agent’s FSM + invariant checker + heal rules handle the rest. None of these systems is failure-free. All of them recover from their common failures without human intervention. That gap — between “failure-free” and “failure-recoverable” — is where most of the real engineering work happens.

State explicitness progression across the four systems — from implicit database rows to audited FSM transitions.


This is Part 1 of Product Thinking (5 parts in total). Next: Part 2 — Security Engineering

In this series

Product Thinking 5 parts

  1. 01 Product Thinking (1): Architecture Design — From Monolith to Autonomous Agents you are here
  2. 02 Product Thinking (2): Security Engineering — Defense Without Paranoia
  3. 03 Product Thinking (3): UX & Design Systems — Tokens, Dark Mode, and Bilingual
  4. 04 Product Thinking (4): Self-Healing Systems — Teaching Machines to Fix Themselves
  5. 05 Product Thinking (5): Abstraction Thinking — From Math to Systems

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub