OpenClaw QuickStart (10): Production Deploy and the Failure Modes Nobody Warns You About
Putting OpenClaw on a real server: an ECS box, pm2 as the supervisor, nginx in front, certs renewed by acme.sh. Then the long tail — the seven failures I have seen at least twice in production, and what each one actually was.
The local install gets you to “it works on my machine.” The server install is what makes it survive a kernel update.
This chapter walks through the deploy I actually use on a 2-core 4G ECS box, then the failures I have seen often enough to put in writing.
The deploy
OS: Ubuntu 22.04. The 4G of RAM matters — 2G works for one agent and chokes the moment a sub-agent spawns.
| |
pm2 supervises the Gateway so that crashes don’t take you down:
| |
For the Web Dashboard (port 18789), nginx in front with certs from acme.sh:
| |
The 600s read timeout is not optional — long-running agent turns will exceed the nginx default of 60s and fail mid-stream.
The seven failures
1. command not found: openclaw after reboot
nvm doesn’t load in non-interactive shells. pm2 startup uses one. Either source nvm in /etc/profile.d/, or symlink the binary:
| |
I prefer the symlink. Less magic.
2. Node.js version too old
OpenClaw needs ≥ 22.16. The error is clear; the real bug is using whatever Node your distro ships. Pin Node 22 and check node -v in the same shell pm2 will use.
3. 401 Unauthorized from DashScope
Two real causes:
- The Coding Plan key (
sk-sp-...) was used against the standard DashScope base URL. They’re different endpoints. Coding Plan goes tohttps://coding.dashscope.aliyuncs.com/v1, regular DashScope tohttps://dashscope.aliyuncs.com/compatible-mode/v1. - The key was leaked, rotated, and not replaced. Check the dashboard.
4. Connection refused on Gateway start
Port 18789 is taken. Find the squatter and kill it:
| |
If 18789 is taken by another OpenClaw, you forgot you ran openclaw gateway outside pm2 earlier today. Stop it, let pm2 own it.
5. DingTalk goes silent after 30 minutes
The long-poll connection is being torn down by an upstream NAT. Two fixes that compound:
| |
If you control the network, also pin the egress to a single IP. Rotating egresses are the actual root cause more often than not.
6. The agent forgets things mid-conversation
Compaction ran without memoryFlush enabled. See chapter 7
— set memoryFlush.enabled: true and a sensible softThresholdTokens. This single config line is what turns “it forgot” into “it remembered.”
7. Token consumption is way too high
Three reasons, in descending order of likelihood:
- Every turn is using your most expensive model. Use a tiered config —
qwen3.5-flashfor routing,qwen3-maxonly when the task needs it. MEMORY.mdballooned past 40 lines and is being loaded every turn. Audit it.- Sub-agents are being spawned for trivial tasks. Inline them.
If none of those: turn on the per-turn token log and look at the actual breakdown. It’s almost never what you guessed.
The “is it healthy” five-liner
I run this every morning the bot exists:
| |
pm2 status confirms the supervisor is happy. openclaw doctor runs the built-in checks. wc -l on MEMORY.md catches creep. The session count tells me whether to archive. df -h catches the disk filling up from logs — which it will, eventually.
Closing
A production OpenClaw is not a clever piece of software; it is a boring piece of software that has been left running for thirty days. Do the deploy by the book, set up the supervisor, put a stable proxy in front, fix the seven failures above before they hit you, and you’ll get there.
That’s the end of the QuickStart. From here the path forks — into custom skills, custom channels, custom MCP servers, multi-agent topologies. All of it builds on the foundations these ten pieces laid down.