
Product Thinking (4): Self-Healing Systems — Teaching Machines to Fix Themselves
The philosophy and engineering of systems that detect, diagnose, and fix their own failures — from watchdog anti-patterns to autonomous improvement engines.
The Bug That Fixed Itself#
One morning in late May 2026, I woke up to a DingTalk notification from my research agent system:
“self_heal Rule 37 triggered: restarted research-pipeline after 3 consecutive OOM kills. Root cause: scanner thread retained full PDF buffers across iterations. Applied patch: explicit
delafter extraction. Validation: 45 minutes post-patch, RSS stable at 1.2 GB (was 2.4 GB pre-patch).”

I had not written that patch. I had not even known there was a problem. The system noticed a pattern — three OOM kills within six hours — correlated kill timing with the scanner’s activity log, diagnosed the leak, proposed a targeted fix, applied it as a git commit, ran a 45-minute validation window, and notified me only after the RSS numbers confirmed stability.
This is self-healing. Not error-swallowing. Not blind retries. A structured, verified response to failure that leaves the system stronger than before.
I did not arrive here overnight. It took months of building, breaking things, learning anti-patterns the hard way, and slowly distilling those experiences into something the system could consume autonomously. This essay traces that journey — from a naive watchdog that killed healthy processes, to a kaizen autopilot that proposes, guards, applies, validates, and learns.
Why Self-Healing Beats Reliability Engineering#
Traditional reliability engineering asks: “How do we prevent failures?” Self-healing asks a different question: “Given that failures are inevitable, how do we make the system’s response to failure an inherent capability — not an external intervention?”
The distinction matters in practice. I run a research agent system on a single 4-vCPU, 7.5 GB server. It manages 12 concurrent agents (scanner, reader, ideator, experimenter, designer, writer, reviewer, statistician, and more), coordinates fleet workers across machines, maintains a knowledge graph with tens of thousands of nodes, and publishes research papers — all autonomously, 24/7. I am one person. I sleep.
If every failure mode required me, the system would be dead half the time. The math is brutal: a system with 40+ known failure modes, each with even a 5% daily probability, will hit something almost every day. At 30 minutes of diagnosis plus fix per incident, I would spend my entire waking life firefighting.
Self-healing changes the economics. Each failure mode, once encountered and fixed by a human, gets encoded as a rule that fires automatically next time. Over months, the system accumulates institutional knowledge — not in documentation nobody reads, but in executable logic that runs every minute. The human’s role shifts from firefighter to rule author to auditor.
The Master Principle#
Every system I build now runs under what I call the Master Principle (师傅原则). The metaphor is deliberate: a master craftsman does not just fix the roof — he teaches the apprentice to recognize when a roof needs fixing, how to diagnose the failure mode, and how to repair it correctly. The knowledge compounds.
The principle has three layers:
Layer 1: Immediate Fix. Something breaks. You fix it. Table stakes — every engineer does this.
Layer 2: Lesson Extraction. After fixing, you extract the pattern as structured data: problem signature (what observable signals indicated the failure), diagnosis path (what you checked and in what order), fix template (the repair, parameterized for reuse). Not a vague “lessons learned” document. Machine-readable structure.
Layer 3: Automated Scan Rule. You encode the lesson into something the system executes. A proactive scan that detects the same failure class before it manifests. An auto-fix rule that applies the same repair when the pattern recurs. A guard that prevents the condition from arising.
Most engineers stop at Layer 1. Good engineers reach Layer 2. The systems I build are designed to reach Layer 3 automatically — the kaizen autopilot reads outcomes from past interventions, distills lessons via LLM, and proposes new scan rules based on observed patterns. The lesson format is structured specifically for this:
scope: supervisor_rules
when: scanner thread RSS exceeds 2x normal baseline for >10 minutes
prefer: send SIGTERM + wait 60s + SIGKILL (not immediate kill)
because: immediate kill leaves shared mmap segments open;
graceful shutdown releases them cleanly
falsifiable: true iff RSS returns to baseline within 5 minutes post-restart
The “because” clause is what separates a lesson from a rule. It lets the system know when the lesson no longer applies — for instance, if the scanner switches from mmap to a streaming reader, Rule 15’s threshold calibration becomes irrelevant.
The meta-principle: your code should be your apprentice, not just your tool. Teach it the reasoning, not just the action.
What the Lesson Format Looks Like in Practice#
The distillation step is where most engineers exit the loop too early. They write a commit message, close the ticket, and move on. What the kaizen proposer actually needs is different — it needs to reason about whether a past lesson applies to a new situation. That requires structure.
The research agent’s lesson store uses this schema:
| |
The retired_when field is the least obvious but most important. Without it, lessons outlive their validity silently. A rule calibrated for a 10,000-node knowledge graph becomes wrong when the graph grows to 100,000 nodes and query latency naturally increases. Rule 22 hit exactly this case — kaizen caught the calibration drift and updated the threshold, preserving the lesson’s structure while changing its parameter.
The proposer reads the lesson store before generating candidates. If an intervention type has a “prefer Y because Z” lesson, the proposer either follows it (and cites it in the proposal) or explicitly argues against it (and records why). This is the mechanism by which past experience constrains future proposals without hard-coding behavior.
Architecture: Four Levels, Four Time Horizons#
My research agent’s self-healing operates at four levels, each handling failures the level below cannot catch.
Level 0: Process Supervision (seconds)#
The supervisor runs every minute via cron. Its job is blunt: are the essential processes alive? If not, restart them. Seven processes are watched:
research-coordinator (systemd, restart on-failure)
research-pipeline (systemd, restart on-failure)
research-dashboard (systemd, restart always)
research-supervisor (systemd, restart always)
research-kg-merger (systemd, on-failure)
research-dingtalk (systemd, on-failure)
research-fleet (supervisor child)
This handles crashes, OOM kills, and hung processes. Crude — but necessary. Without a stable substrate, higher-level healing has nothing to run on.
The supervisor also handles stale code. When source files in agents/, framework/, or lib/ are modified, the coordinator (which imported those modules at startup) is running old code. The supervisor detects this by comparing source mtime against process start time, then sends SIGTERM with a 5-minute debounce. If graceful shutdown takes longer than 60 seconds, SIGKILL follows.
Level 1: Signal Monitoring (minutes)#
Every 5 minutes, a health snapshot captures time-series signals:
papers_accepted_24h (output throughput)
ideas_approved_rate_7d (pipeline yield)
api_error_rate_1h (upstream health)
coordinator_rss_mb (memory trend)
pipeline_rss_mb (memory trend)
worker_sync_lag_s (fleet consistency)
self_heal_recoveries_24h (self-heal success rate)
self_heal_failures_24h (escalation count)
Each signal carries metadata: unit, directionality (“is rising good or bad?”), alert threshold. The kaizen system uses these signals both as observation inputs (what does the system look like now?) and as validation outputs (did my intervention actually move the needle?).
Level 2: Self-Heal Rules (minutes to hours)#
The research agent runs 40+ self-heal rules, split across the supervisor and coordinator. Every rule follows the same structure:
- Trigger: pattern match on logs, metrics, or state
- Diagnosis: confirm the problem is real, not a transient blip
- Action: concrete, bounded repair
- Validation: confirm the fix worked
- Escalation: notify human if fix failed
Rules are numbered (Rule 1 through Rule 41+) and logged with a consistent prefix: self_heal Rule N: <action>. This makes them greppable, countable, and auditable.
Before these rules, I spent months in a reactive bug-fix cycle — non-atomic writes causing corruption, PID-based locks that became stale after crashes, no state validation across agent transitions. The systemic fix in May 2026 addressed these at the foundation level: atomic file writes (temp-file + rename), fcntl.flock-based locks that auto-release on crash, a state_machine.py that validates protocol/idea status transitions with history tracking, and an invariant_checker.py that runs every 10 coordinator cycles checking experiment consistency, stale locks, and resource usage.
The invariant checker is the most important piece. Without it, inconsistencies accumulate silently until they manifest as confusing symptoms hours or days later. With it, violations are detected and logged immediately — and self-heal rules can act on the violation log before it compounds.
A rule in practice:
Rule 12: Stale Lock Sweep
trigger: lock file exists longer than 2x expected process lifetime
diagnose: check if PID in lock file is still running
action: if PID is dead, delete the lock file
validate: blocked process resumes within 60 seconds
escalate: if process still stalled after lock removal, restart it
Before these rules existed, I would get paged with “pipeline stalled” and spend 20 minutes discovering a stale .lock file from a process that died 4 hours earlier. Now Rule 12 clears it automatically, usually before I notice anything is wrong.
Level 3: Kaizen Autopilot (hours to days)#
This is the interesting part. The kaizen system is a separate daemon that observes the research agent from outside, proposes improvements, applies them with safety guards, validates results, and distills lessons for future proposals. It runs on its own systemd unit and cannot modify itself (that path is hard-blocked).
The loop: Observe → Propose → Guard → Apply → Validate → Learn.
Observe: every 5 minutes, sample all signals and build time-series history.
Propose: every hour, an LLM (“the proposer”) reviews recent observations, current lessons, and past interventions. It generates candidate interventions — concrete actions with explicit hypotheses and numerical predictions.
Guard (pre-flight): before any intervention is applied, it must clear six checks:
- Not in a blocked path (kaizen cannot touch itself, git internals, or secrets)
- Not a blocked action kind
- Within daily budget (max 5 interventions per day on research agent; max 3 on AI4Marketing’s production-facing variant)
- Below concurrent limit (max 2 simultaneous)
- Not a duplicate of recent actions (2-hour dedup window)
- Has a concrete rollback defined
For risk >= medium: canary infrastructure must exist. If it does not, the intervention is proposed into backlog but never applied — which forces me to build canary capability before the system can self-modify in that area. The constraint is generative.
Apply: the adapter executes the action as a git commit. The commit SHA is recorded for deterministic rollback.
Validate: after a configurable window (minimum 60 minutes), check whether the intervention’s predictions held. Did the target signal move in the predicted direction by the predicted magnitude?
Learn: every 24 hours, the distiller reviews all finalized interventions — validated, no_effect, reverted — and extracts structured lessons. These feed back into the next proposer cycle.
Meta-review: weekly, a higher-level analysis examines calibration drift (are predictions getting more or less accurate over time?), recurring blind spots, and highest-value next moves.
War Stories: Five Failures That Taught Me Everything#
1. The Watchdog That Killed Healthy Processes#
Early on, I had a naive watchdog for the DingTalk listener. It checked whether the log file’s modification time was recent. If the log had not been written to in 5 minutes, the watchdog assumed the process was dead and killed it.
DingTalk is message-driven. If no messages arrive for 5 minutes — normal at night — the process writes nothing to the log. It is perfectly healthy: just waiting. The watchdog faithfully killed it every night, generating a restart loop that fired spurious “process restarted” alerts, which in turn triggered downstream noise.
The lesson: liveness signals must be intrinsic to the process, not inferred from its output. A message-driven process with no messages is idle, not dead. The fix was a periodic heartbeat inside the listener and supervision via systemd’s WatchdogSec= protocol (the process pings systemd’s socket every N seconds; systemd kills and restarts it if the ping stops).
The deeper pattern: using output artifacts as liveness probes conflates “doing work” with “being alive.” These are orthogonal. A process can be alive but idle. A process can be busy-looping while effectively dead (processing nothing useful). Good liveness checks test the control plane. Productivity checks test the data plane. Never conflate them.
This became kaizen scan rule: flag any monitor that uses file mtime as a liveness signal.
2. The Reward Guard That Was 0% Effective#
The research agent has a reward guard — a quality gate that validates experiment results before they enter the knowledge graph. It checks effect sizes, sample sizes, and replication attempts.
For two days, the guard had a 0% hit rate. Every single experiment sailed through unchecked. The system was accumulating garbage data while reporting everything as healthy.
The root cause was an async-race. The guard ran at the “point of produce” — immediately after the experimenter agent reported completion, checking for the result artifact on disk. But the experimenter wrote results asynchronously: the agent signaled done, the coordinator advanced to the next step, and the file write happened milliseconds later via an async flush.
Result: the guard checked for a file that had not yet landed. It found nothing. Its logic was “if no artifact exists, there is nothing to guard” — reasonable for the case where an experiment produces no output. So it passed. Every time.
The fix: move the guard to the “point of use.” Validate the artifact where the next consumer (the statistician, the knowledge engine) reads it. At that point, the file is guaranteed to exist (or the consumer would have failed). Guard at consumption, not production.
The 14 papers accepted during the 2-day window were retroactively revalidated. Several were flagged as requiring replication. The kaizen lesson now reads: “in async pipelines, prefer point-of-use validation; point-of-produce guards can race async flushes.”
3. The Experiment That Lost 57% of Its Data#
The experimenter agent runs Python experiments via the Claude Code CLI. Each experiment has 4–8 conditions. The CLI has a timeout. When the timeout fires, the process is killed.
For months, 57% of experiments had no usable data. The reason: a timeout kill discarded everything. Conditions 1–3 might have completed with valid data — but none was preserved. The next run started from scratch, repeated conditions 1–3, timed out again at condition 4, and the cycle repeated. Wasted compute, zero progress.
The fix had two parts:
Checkpoint awareness: inject per-condition raw counts into the next prompt invocation. The agent could then see: “condition 1: 200 samples complete, condition 2: 200 samples complete, condition 3: 45 samples (partial), condition 4: not started.” It would resume from condition 3 rather than restarting.
Partial data preservation: rather than discarding incomplete experiments, the statistician learned to work with partial datasets — flagging them as lower-powered but still informative. Data accumulated across runs instead of evaporating on each timeout.
Result: completion rate went from 43% to 89%. The remaining 11% are experiments that genuinely require more compute than the timeout allows — and even those now accumulate data across runs, building toward eventual completion.
The principle: never design a system where a crash means total data loss. Every long-running operation should be checkpointable. This is obvious in database design (write-ahead logs) but almost never applied to LLM agent workflows, where practitioners treat each invocation as stateless. It is not.
4. The Pipeline That Starved for 28 Days#
The research pipeline dispatches work to agents via an FSM-based coordinator. Ideas transition through states: proposed → debating → approved → assigned → running → complete. One branch — finance ideas — produced zero approved ideas for 28 consecutive days.
The surface symptom was “finance ideas not approved.” The investigation required tracing the full dispatch chain: from ideator generation through debate gate through classifier through coordinator assignment. The root cause was enumerate_candidates() in the coordinator: it filtered ideas by a naming convention, and finance ideas generated by the debate pass were tagged *_debate — a suffix the filter skipped by design to avoid re-debating already-debated ideas.
But the filtering condition was wrong. It skipped *_debate ideas at the assignment stage, not just the debate stage. So finance ideas that had successfully passed debate were never assigned to the experimenter. They sat in “approved” state indefinitely. 40 out of 41 approved finance ideas were in this state — starved for 28 days.
The systemic lesson: FSM diagnosis requires tracing the full state chain from source to sink, not just checking individual transitions. The kaizen lesson: “when a dispatch pipeline shows 0% throughput for a specific category, trace enumerate_candidates filtering logic — naming conventions that exclude states for one purpose can accidentally exclude them everywhere.”
5. The Logger That Was Never Logging#
The research agent daemons each set up their own logger with a SQLiteHandler — writing structured log records to a local database for later analysis. For a period of two days, the DingTalk listener’s error logs were silent even when errors were occurring. The handler was configured correctly. The daemon was running. But no records were appearing in the database.
The cause: the root logger already had a handler attached — a SqliteHandler from an earlier import. When the daemon called logging.basicConfig(...) to configure its own handler, basicConfig is a no-op when the root logger already has handlers. The daemon’s handler was never registered. All log output went to the existing root handler (a StreamHandler writing to stdout), which was not being captured in the analysis pipeline.
The fix: explicit handler attachment instead of relying on basicConfig. Every daemon that needs structured logging now does:
| |
The propagate = False line is the second thing engineers miss. Without it, a record handled by the daemon’s own logger also propagates to the root logger and gets handled there too — double logging, double storage cost, confusing analysis.
The lesson: “never use basicConfig in a daemon that might be imported by a process that already called basicConfig (or configured the root logger). Use explicit logger-level handler attachment with propagate = False.” This is now a kaizen scan rule: detect any daemon-init code using basicConfig when a root logger handler is already present.
6. The Circuit Breaker That Never Tripped#
The system has a circuit breaker for API calls (DashScope, Claude, external services). When a service returns too many errors, the breaker opens and redirects to a fallback.
For weeks, write operations to one service failed at ~30%, but the breaker never triggered. Reads succeeded. The breaker showed a clean bill of health.
The bug was in record_success(). On a successful call, it cleared the entire failure window — not decremented a counter but wiped it. With alternating read-success / write-failure patterns, every successful read reset the failure count to zero before write failures could accumulate to the threshold. The breaker could never trip.
The fix distinguished between states:
- CLOSED (normal): failures tracked with a time window; successes do not clear failures — they expire by time only.
- HALF_OPEN (testing recovery): a single probe success → move back to CLOSED and reset.
- OPEN (tripped): all calls short-circuited to fallback.
The key insight: in CLOSED state, successes and failures are independent evidence. A successful read does not undo a failed write — they may be hitting different endpoints, different code paths, different backend shards. Only in HALF_OPEN — where you are explicitly testing “has the problem resolved?” — does a success mean “clear the failure record.”
The popular pattern “N failures in M seconds, reset on any success” is broken for mixed-operation services. The kaizen lesson captures this: “circuit breaker CLOSED state must age out failures by time, not clear them on success; success-based clear is correct only in HALF_OPEN.”
Self-Healing Is NOT Error-Swallowing#
I need to be clear about this, because the most common misunderstanding about self-healing systems is that they are elaborate try/except: pass blocks.
Self-healing is the opposite of error-swallowing. Error-swallowing hides problems. Self-healing surfaces them, diagnoses them, fixes them, validates the fix, and records the lesson. The system is more observable after self-healing is implemented, not less.
Every self-heal action is logged with its complete decision chain:
- What triggered it (the anomaly or pattern match)
- What it diagnosed (root cause hypothesis)
- What it did (the specific action)
- What it predicted (expected outcome with quantified magnitude)
- What actually happened (validation result)
- What it learned (distilled lesson)
grep "self_heal Rule" /data/research-agent/logs/*.log | wc -l gives me a count of every autonomous intervention. I can audit each one. I can disagree with the system’s choices. I can tighten guard conditions. The kaizen guards (max 5/day, mandatory rollback plans, area freezes after failed fixes) exist specifically to prevent the system from spiraling into unobservable self-modification.
Design principle: self-healing must be more transparent than manual fixing, not less. When I fix something by hand, I might forget to document it. When the system fixes something, it is structurally incapable of not documenting it — the documentation is the mechanism.
The Health Endpoint: Observability as a First-Class Feature#
None of the self-healing above works without a comprehensive view of system state. Before building any auto-fix logic, I built the /health endpoint — a single URL that returns a structured JSON snapshot of the entire research agent’s state: pipeline status, agent activity, fleet worker sync lag, memory usage per process, knowledge graph size and recent growth, experiment throughput, and the self-heal rule firing history for the last 24 hours.
GET http://113.249.102.134:8081/health
{
"pipeline": "running",
"coordinator_rss_mb": 812,
"pipeline_rss_mb": 614,
"graph_nodes": 47532,
"graph_edges": 129968,
"papers_accepted_24h": 3,
"self_heal_recoveries_24h": 2,
"self_heal_failures_24h": 0,
"worker_sync_lag_s": 4.2,
"invariant_violations_open": 0
}
I check this before making any intervention, manual or automated. The kaizen proposer calls it every 5 minutes to build its signal history. The invariant_violations_open field is the one that deserves the most attention — a nonzero count means the invariant checker found inconsistency that no rule has yet resolved. That is the signal that something new is happening.
The /health endpoint is also the primary way I do a sanity check from my laptop without SSH: curl -s http://....:8081/health | jq .pipeline. If it returns “running”, everything at Level 0 is healthy. If it returns “crashed” or the request times out, Level 0 has already handled it or is about to. The speed and specificity of that check is what makes the difference between 5-minute recovery and 30-minute recovery.
From Reactive to Proactive to Autonomous#
Looking at how the system evolved, three phases are clear.
Phase 1: Reactive (Months 1–2)#
Everything was manual. Supervisor ran systemctl restart on crashed processes. I got DingTalk alerts. I logged in, diagnosed, fixed. Average time-to-recovery: 30–60 minutes, depending on when I noticed.
The system had roughly 5 known failure modes, all handled by “restart the thing.” Adequate when the system was simple — one pipeline, no fleet, no knowledge graph.
Phase 2: Proactive (Months 3–4)#
I started encoding patterns. Instead of just restarting crashed processes, the supervisor could detect pre-crash conditions: memory climbing toward OOM, disk filling, API rate limits approaching. It took preemptive action: garbage collection, log rotation, API backoff.
Self-heal rules grew from 5 to 20+. Each rule had documentation baked in. When I added Rule 15 (restart pipeline if scanner RSS exceeds 2 GB), I wrote down why, what triggered the threshold choice, and what the expected post-restart behavior was. This documentation was not for me — it was for the kaizen proposer to read later.
Time-to-recovery for known failure modes dropped to 1–5 minutes. Most resolved before I noticed anything. Unknown failure modes still required me.
Phase 3: Autonomous (Month 5+)#
The kaizen autopilot moved the system from “fixed repertoire of responses” to “proposes novel responses.” It is not limited to the 40+ rules I wrote. It observes the system, hypothesizes about problems it has not seen before, and proposes fixes — subject to the guard rails.
The most interesting emergent behavior: kaizen started proposing improvements to the self-heal rules themselves. Rule 22 was restarting a process that was actually healthy but slow — high restart count, no improvement in output metrics. Kaizen noticed, proposed relaxing the threshold, applied it, validated that restart count dropped with no degradation in output quality, and distilled a lesson: “Rule 22 threshold was calibrated for a smaller knowledge graph; as the graph grows, query times naturally increase and the old threshold becomes too aggressive.”
This is meta-learning. The system is not just fixing bugs — it is improving its own bug-fixing apparatus.
The Three Hard Rules of Autonomous Modification#
Any system that modifies itself needs hard constraints. Without them, you get a system that “fixes” itself into an unrecoverable state. My rules, learned through failure:
Rule 1: Every Intervention Has a Rollback#
No rollback plan, no apply — enforced at the guard level. For the research agent, rollbacks are git reverts: deterministic and auditable. Early in kaizen’s life, it proposed a schema change with “restore from backup” as the rollback. There was no backup. If the schema change had caused corruption, recovery would have been impossible. Now the adapter requires that any schema-modifying action triggers a backup as part of the rollback specification.
Rule 2: Risk ≥ Medium Requires Canary Infrastructure#
Changes that could affect system stability (process parameters, API routing, pipeline logic) need a canary — a subset of traffic or a shadow mode where the change can be validated without full commitment. If no canary infrastructure exists in that area, the intervention is proposed into backlog but blocked from application. This forces me to build canary capability before the system can self-modify there. The safety constraint is also a product roadmap.
Rule 3: Three Consecutive Rollbacks Freeze the Area for 24 Hours#
If an area receives three consecutive rollbacks, no further interventions are proposed or applied for 24 hours. This prevents thrashing: repeatedly attempting the same kind of fix in an area where the problem is structural and requires human insight. The freeze is the system saying “I have tried and failed three times. I need help.” That is the right behavior.
The Four-Step Meta-Pattern#
Across all the incidents above, the same pattern recurs. I call it fix → distill → encode → verify.
Fix: you encounter a failure and repair it. The immediate cost is already paid. The fix itself is not the lesson — it is the raw material.
Distill: you extract the lesson — not “what I changed” but “when to detect this, how to diagnose it, why this fix works, when it might not apply.” This step takes 15–20 minutes and most engineers skip it. Skipping it means the next engineer (or the next version of you, six months later) starts from scratch.
Encode: you write the lesson into something the system can execute. A scan rule. A guard condition. A validation threshold. The lesson format matters: when X / prefer Y / because Z / falsifiable by W. The “because Z” clause makes a lesson reusable beyond the original context. The “falsifiable by W” clause makes it testable. Without both, the lesson is a comment, not a rule.
Verify: you run the system with the new rule active and confirm it catches the failure class it was designed for. This can be done by injecting a synthetic instance of the failure (seeding a stale lock file, introducing an out-of-range RSS spike in a test environment) and checking that the rule fires correctly. Without this step, you have faith, not engineering.
The fix for Rule 12 (stale lock sweep) took 2 hours to diagnose and implement. The distill, encode, verify loop took 25 minutes. In the three months since, the rule has fired 47 times automatically — each firing saving the 2-hour debugging cycle. The break-even was reached on the third firing. Everything after that is pure return.
This is why the Master Principle insists on all three layers. Layer 1 (fix) generates the raw experience. Layer 2 (distill) extracts reusable knowledge from it. Layer 3 (encode) deploys that knowledge as automated capability. Missing any layer breaks the compound interest.
Lessons for Anyone Building Self-Healing Systems#
Start with supervision, not intelligence. Before building an LLM-powered proposer, make sure processes restart when they crash. This resolves 60% of production issues.
Make failure observable before making it fixable. You cannot heal what you cannot see. Invest in structured health snapshots, error aggregation, and signal time series before building auto-fix logic.
Guard at consumption, not production. In async systems, validate artifacts where they are read, not where they are written. The write might not have completed. The artifact might have been corrupted after writing. The consumer knows what it needs.
Separate liveness from productivity. A process can be alive but idle, or producing output while effectively dead (infinite retry loop with no progress). Liveness probes test the control plane. Productivity probes test the data plane. Do not conflate them.
Encode the WHY, not just the WHAT. The “because Z” clause in a lesson is what lets the system know when the lesson no longer applies. Without it, rules outlive their validity silently.
Accept that self-healing has a cost. Every self-heal rule is code that must be maintained. Every kaizen intervention is a change that could introduce new bugs. Bound the growth: max interventions per day, mandatory validation windows, area freezes. Unbounded self-modification is a liability, not a feature.
The goal is not zero human intervention. The goal is that human intervention addresses novel problems, not recurring ones. Every recurring problem that requires me is a failure of the self-healing system. Every novel problem escalated to me is a success.
Design the escalation path before the fix path. When self-healing fails — three consecutive rollbacks, a guard condition that can’t be satisfied, a prediction that never validates — the system needs to escalate gracefully. A 24-hour area freeze with a DingTalk notification is an escalation path. Kaizen proposing an intervention into backlog with a note “blocked on missing canary infra” is an escalation path. Silent failure is not.
Version your lessons. Lessons that reference specific thresholds (RSS > 2 GB, timeout > 300s, error rate > 5%) become stale as the system grows. Track when each lesson was created, what conditions it was calibrated for, and what would cause it to be retired. Without this, the lesson store becomes a graveyard of outdated rules that the proposer reads and misapplies.
What Comes Next: Cross-System Learning#
The research agent currently distills lessons from its own interventions. The next step is cross-system learning — a lesson learned from a failure on one system propagates to other systems via the kaizen DaaS chain protocol. The architecture: each host maintains an outbox of signed lessons (Ed25519-signed JSON) in a shared GitHub branch. Every 15 minutes, each host pushes its new lessons and pulls its peers’. The proposer reads the inbox and checks whether any imported lesson applies to local code.
In 2026, two rounds of cross-system learning are verified:
Round 1: a DaaS system produced a lesson about JSON silent loss (a write that appeared to succeed but truncated the file). The lesson propagated to the research agent’s inbox. The proposer audited lib/pro_gate.py:81 and found the identical pattern — a file write without flush confirmation. A fix was seeded automatically.
Round 2: the same lesson also propagated to AI4Marketing. The proposer audited kaizen’s own intervention state file on that host and found the same bug — in kaizen’s own code. Meta-fragility: the self-healing system had the same vulnerability it was teaching other systems to avoid. The fix was proposed, cleared guards, and applied.
This is the compounding return at its most recursive. A lesson generated from a DaaS failure, propagated across two other systems, found a bug in the self-healing system itself. The loop is: one human fix → one lesson → three systems patched.
The deeper question is where this ceiling is. The system is demonstrably better at diagnosing failures than it was six months ago. It is improving faster than I am improving it, because it now generates more lessons per day than I could write manually. I do not know where that trajectory ends. I know the current constraints: the proposer’s context window, the quality of the falsifiability conditions, the coverage of the signal catalog. Those are the bottlenecks to work on next.
What It Means to Teach a Machine#
I began this essay with a system that fixed a memory leak while I slept. The philosophical frame behind it: traditional software engineering treats code as a tool. You write it, it does what you told it, you fix it when it breaks. One-directional.
Self-healing systems change part of that relationship. The code observes itself, proposes changes, and executes them — within a framework of constraints, validation, and lesson accumulation that a human designed. The system is not autonomous in the sense of “does whatever it wants.” It is autonomous in the sense of “handles routine problems so the human can focus on hard ones.”
The right metaphor is not a robot replacing a human. It is an apprentice learning from a master. The master (me) fixes problems and, in doing so, teaches the apprentice (the system) how to recognize and fix that class of problem. The apprentice handles more and more of the routine work. The master’s role shifts from “doing” to “teaching” to “auditing.”
This is what the Master Principle means at its deepest level. Every time I fix a bug, I am not just fixing it — I am teaching. The quality of my teaching determines how capable my apprentice becomes. A vague lesson (“something was wrong with async”) produces a fragile rule. A precise lesson (“in async pipelines, guard at the consumer boundary because the producer’s write may not have flushed”) produces a rule that generalizes.
The aspiration: I disappear for a week and come back to a system healthier than when I left — not because nothing went wrong, but because everything that went wrong was handled, validated, documented, and learned from. We are not there yet. But every self-heal rule, every kaizen lesson, every guard validation is a step. And the system is learning faster than it was six months ago, because it is now learning from its own learning.
Your code should be your apprentice, not just your tool. Teach it the reasoning, not just the action.
This is Part 4 of Product Thinking (5 parts in total). Previous: Part 3 — UX & Design Systems · Next: Part 5 — Abstraction Thinking
Product Thinking 5 parts
- 01 Product Thinking (1): Architecture Design — From Monolith to Autonomous Agents
- 02 Product Thinking (2): Security Engineering — Defense Without Paranoia
- 03 Product Thinking (3): UX & Design Systems — Tokens, Dark Mode, and Bilingual
- 04 Product Thinking (4): Self-Healing Systems — Teaching Machines to Fix Themselves you are here
- 05 Product Thinking (5): Abstraction Thinking — From Math to Systems