
OpenClaw QuickStart (10): Production Deploy and the Failure Modes Nobody Warns You About
ECS, pm2, nginx, acme.sh — plus eight production failure modes and their fixes.
The local install gets you to ‘it works on my machine.’ The server install ensures it survives a kernel update.
This chapter walks through the deployment I use on a 2-core 4GB ECS box and the common failures I’ve documented.

Choosing your server#

Before deploying, choose the right server. Consider these four options:
Alibaba Cloud ECS: The path I use. A 2-core 4GB instance in the cn-beijing region costs around $15/month. The advantage is proximity to DashScope, which reduces API round-trips from 200ms to 20ms when both the gateway and the model are in the same region. The disadvantage is that the Great Firewall can make outbound package installs flaky unless you set up a mirror.
DigitalOcean: Simplest onboarding. The $18/month “Basic” droplet (2vCPU, 4GB) is functionally identical to ECS. The dashboard is cleaner, the documentation is better, and package mirrors are unnecessary. The tradeoff is increased latency—150ms per turn—if your model provider is Alibaba or Tencent.
Hetzner: Best price-to-performance. A CX21 instance (2vCPU, 4GB, Nuremberg) runs $5.83/month. The catch is that Hetzner’s network is optimized for Europe, so if your users or model endpoints are in Asia, you’ll experience higher latency.
Home server: Free hardware, full control, and unlimited disk space. The downsides include uptime (your ISP doesn’t guarantee five nines), dynamic IPs (you’ll need DDNS), and port forwarding (which your router may block for inbound 443). A home server is suitable for prototyping but not for production unless you’re the only user.
Minimum spec: 2-core 4GB. Why? The gateway uses 200MB of resident memory, but model responses buffer in RAM before streaming to the client. When a sub-agent forks, the OS temporarily duplicates the parent process. I’ve seen a 2GB instance run out of memory during a code review task that spawned three sub-agents in parallel. 4GB provides headroom, while 8GB eliminates the problem entirely.
The deploy#
OS: Ubuntu 22.04. 4GB of RAM is crucial—2GB is sufficient for one agent but becomes insufficient when a sub-agent spawns.
| |
pm2 supervises the Gateway to prevent crashes from taking you down:
| |
For the Web Dashboard (port 18789), use nginx with certs from acme.sh:
| |
The 600-second read timeout is necessary—long-running agent turns will exceed the nginx default of 60 seconds and fail mid-stream.
Docker alternative#
If you prefer containers, OpenClaw supports Docker. Here is the compose file I use for multi-service deployments:
| |
Three volumes are essential:
config/holds API keys, channel credentials, model endpoints. Without this mount, each container restart erases your setup.workspace/contains MEMORY.md and session logs. Losing this means the agent forgets everything between restarts.skills/stores custom skills. If you mount this read-only, the agent can read skills but can’t write new ones during self-improvement.
When Docker is better: You run OpenClaw alongside other services (a database, a vector store, a monitoring stack) and you want one docker-compose up to bring everything online. The isolation also makes it easier to test config changes — spin up a second container with a different config, compare behavior, kill the worse one.
When Docker is worse: Debugging file permissions — the container runs as root, your host files may not be, and you will spend time with chown. Inspecting logs requires docker exec or volume mounts. Hot-reloading skills during development is slower because the filesystem sync has a delay. For a single-service deploy where you ssh into the box and tail logs directly, bare-metal is simpler.
The eight failures#

command not found: openclaw after reboot#
nvm doesn’t load in non-interactive shells. pm2 startup uses one. Either source nvm in /etc/profile.d/, or symlink the binary:
| |
Detection: pm2 status shows the gateway in errored state with exit code 127 immediately after boot.
Node.js version too old#
OpenClaw needs >= 22.16. The error is clear; the real bug is using whatever Node your distro ships.
Detection: Gateway fails to start, and pm2 logs openclaw-gateway --err shows a version mismatch in the first three lines.
401 Unauthorized from DashScope#
Two causes: Coding Plan key used against the wrong endpoint, or key rotated and not replaced.
Detection: Every agent turn fails instantly with a 401 in the response. Check ~/.openclaw/agents/main/sessions/*.jsonl — if the last line of every session is an auth error, it’s your key.
Connection refused on Gateway start#
Port 18789 is taken:
| |
Detection: pm2 logs openclaw-gateway --err shows EADDRINUSE within the first second of startup. The gateway never reaches the “listening on 18789” log line.
DingTalk goes silent after 30 minutes#
The long-poll connection is being torn down by an upstream NAT:
| |
Detection: Gateway log shows [dingtalk] reconnecting... more than once per hour. Users report “the bot stopped responding” but manual messages from the web dashboard still work.
The agent forgets things mid-conversation#
Compaction ran without memoryFlush enabled. Set memoryFlush.enabled: true.
Detection: A multi-turn conversation suddenly loses context after turn 15. Check session length: cat ~/.openclaw/agents/main/sessions/<session-id>.jsonl | wc -l. If it’s exactly 20 lines (the default compaction threshold), compaction discarded turns instead of summarizing.
Token consumption is way too high#
Three reasons: expensive default model, bloated MEMORY.md, or sub-agents spawning for trivial tasks.
Detection: Your bill doubles week-over-week despite stable usage. Run openclaw stats tokens --since 7d and compare the per-turn average. If it climbs above 8k tokens/turn for a conversational agent, something is wrong. Grep MEMORY.md for length: wc -l ~/.openclaw/workspace/MEMORY.md. Anything above 500 lines is a red flag.
Memory grows unbounded#
What happens when you never archive sessions: MEMORY.md bloats past 100 lines, then 200, then 500. Every agent turn now includes half a kilobyte of irrelevant context (“three weeks ago the user asked about Docker”). Startup slows because the workspace loader parses the entire memory file on boot. At 1000 lines, startup takes 30 seconds. At 2000 lines, the agent begins timing out mid-turn because the context window is 80% memory and 20% actual task.
Fix: Automate the weekly cleanup. Add a cron job that archives old sessions and trims MEMORY.md:
| |
The archive command moves sessions older than 30 days into a .archive/ subdirectory (still readable, just not loaded by default). The compact command uses the LLM to summarize MEMORY.md down to 100 lines, preserving the most important facts and discarding low-value details.
Detection: openclaw gateway takes more than 10 seconds to print “Gateway listening on 18789”. Or check file size directly: wc -l ~/.openclaw/workspace/MEMORY.md. Anything above 300 lines warrants a manual review. Above 500 lines, compact immediately.
Upgrade path#
OpenClaw moves fast. New features ship weekly, and occasionally a release changes the config schema or deprecates a skill field. Here is how to upgrade safely:
Check the changelog:
openclaw changelog --since <current-version>. Look for breaking changes, deprecated fields, or new required config keys.Backup your config:
cp -r ~/.openclaw/config ~/.openclaw/config.backup. If the upgrade breaks, you can restore in ten seconds.Upgrade the binary:
npm i -g openclaw@latest. This pulls the new version but does not restart anything.Restart the gateway:
pm2 restart openclaw-gateway. Watch the logs for the first 60 seconds:pm2 logs openclaw-gateway --lines 100. If you see repeated errors or a crash loop, roll back:npm i -g openclaw@<old-version> && pm2 restart openclaw-gateway.Verify health: Hit the dashboard at
https://agent.example.comand send a test message. Check that skills load, memory persists, and the agent responds coherently.
What breaks during upgrades:
Config schema changes: A field gets renamed (
model.namebecomesmodel.id), and the gateway fails to parse your config. The error message usually tells you which field is invalid. Fix: update the config, restart.Deprecated skill fields: Your custom skill uses
skill.parametersbut the new version expectsskill.input. The skill loader throws a validation error. Fix: regenerate the skill withopenclaw skill createor manually update the schema.Dependency conflicts: Rare, but it happens. A new OpenClaw version needs a library that conflicts with something else you installed globally. Symptom: the upgrade succeeds, but the gateway crashes on startup with a module resolution error. Fix: use
nvmto isolate Node environments, or run OpenClaw in Docker to avoid global installs entirely.
Golden rule: Never upgrade during peak hours. Do it at 2 AM on a Sunday, when a five-minute outage is invisible.
Monitoring and alerting#
A production service you cannot monitor is a production service you do not control. Three layers:
Health check#
Set up a cron job that pings the gateway every five minutes and alerts if it goes down:
| |
Add to cron:
| |
If you run multiple services, use a real monitoring stack (Prometheus, Grafana, Uptime Kuma). But for a single-agent deploy, a bash script and a mail command are enough.
Log rotation#
The gateway writes to stdout, pm2 captures it, and without rotation, ~/.pm2/logs/openclaw-gateway-out.log grows forever. After three months, it hits 2GB and fills your disk.
Create /etc/logrotate.d/openclaw:
| |
This keeps seven days of logs, compresses old ones, and tells pm2 to re-open the log file after rotation.
Disk space alerts#
The agent writes session logs to ~/.openclaw/agents/main/sessions/. If you run a popular agent, this directory grows at 10MB/day. After a year, it’s 3.6GB. If your server has a 20GB root partition, you will eventually fill it.
Add a disk space check to the same healthcheck script:
| |
When the alert fires, either archive old sessions (openclaw memory archive --older-than 90d) or expand the disk.
The “is it healthy” five-liner#
| |
Summary#
A production OpenClaw is not a clever piece of software; it is a boring piece of software that has been left running for thirty days. Do the deploy by the book, set up the supervisor, put a stable proxy in front, fix the eight failures above before they hit you, automate the monitoring, and you’ll get there.
That’s the end of the QuickStart. From here the path forks — into custom skills, custom channels, custom MCP servers, multi-agent topologies. All of it builds on the foundations these ten pieces laid down.
OpenClaw QuickStart 10 parts
- 01 OpenClaw QuickStart (1): What This Thing Actually Is
- 02 OpenClaw QuickStart (2): Install and First Chat in 10 Minutes
- 03 OpenClaw QuickStart (3): The Six Layers That Make the Agent Loop Work
- 04 OpenClaw QuickStart (4): Configuration, Model Providers, and the Coding Plan Trick
- 05 OpenClaw QuickStart (5): Wiring Telegram, DingTalk, and the WeChat Reality
- 06 OpenClaw QuickStart (6): Skills, MCP, and Shipping Something Real
- 07 OpenClaw QuickStart (7): The Memory System, Without the Magic
- 08 OpenClaw QuickStart (8): Heartbeat, Cron, and Getting Pinged at 7am
- 09 OpenClaw QuickStart (9): The China IM Picker, with Honest Tradeoffs
- 10 OpenClaw QuickStart (10): Production Deploy and the Failure Modes Nobody Warns You About you are here