OpenClaw 快速上手（十）：上生产部署，以及没人提醒你的那些故障模式

本地装完，你拿到的是"在我机器上能跑"。服务器装完，你拿到的是"内核更新它都能活下来"。

这一篇先把我自己在一台 2 核 4G ECS 上的部署走一遍，然后是常见到值得写下来的故障。

部署

OS：Ubuntu 22.04。4G 内存这件事不能省——2G 跑单 Agent 还行，子 Agent 一拉起来就被挤死。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 1. nvm 装 Node 22——大多数发行版自带的 Node 太老
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 22 && nvm use 22

# 2. 全局装 OpenClaw 和 pm2
npm i -g openclaw@latest pm2

# 3. 初始化 workspace
openclaw init
# 编辑 ~/.openclaw/openclaw.json，至少把模型 provider 设上

pm2 看着 Gateway，崩了不会把你也带下去：

1
2
3
pm2 start "openclaw gateway" --name openclaw-gateway --time
pm2 save
pm2 startup           # 按它打印的 sudo 那行执行

Web Dashboard 在 18789。前面套 nginx，证书走 acme.sh：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
server {
  listen 443 ssl http2;
  server_name agent.example.com;

  ssl_certificate     /etc/nginx/ssl/agent.example.com.cer;
  ssl_certificate_key /etc/nginx/ssl/agent.example.com.key;

  location / {
    proxy_pass http://127.0.0.1:18789;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_read_timeout 600s;
  }
}

那个 600s 的 read timeout 不是可选的——长跑的 Agent 一轮对话会超过 nginx 默认的 60s，到那时连接就被掐了。

七种故障

1. 重启之后 `command not found: openclaw`

nvm 在非交互 shell 里不会自动加载，pm2 startup 用的就是非交互 shell。要么把 nvm 加到 /etc/profile.d/，要么把二进制软链出去：

1
2
sudo ln -sf $(which openclaw) /usr/local/bin/openclaw
sudo ln -sf $(which node) /usr/local/bin/node

我偏好软链。少点魔法。

2. `Node.js version too old`

OpenClaw 要求 ≥ 22.16。报错本身很清楚；真正的 bug 在于你用了发行版自带的 Node。把 Node 22 钉死，并且在 pm2 实际用的那个 shell 里确认 node -v。

3. DashScope 报 `401 Unauthorized`

两个真因：

Coding Plan 的 key（sk-sp-...）打到了普通 DashScope 的 base URL 上。两个端点是分开的：Coding Plan 走 https://coding.dashscope.aliyuncs.com/v1，普通 DashScope 走 https://dashscope.aliyuncs.com/compatible-mode/v1。
key 泄露被轮换了，没换上来。去后台核对。

4. Gateway 启动 `Connection refused`

18789 端口被占了。找到占用方杀掉：

1
2
3
lsof -i :18789
kill $(lsof -t -i :18789)
pm2 restart openclaw-gateway

如果占用方是另一个 OpenClaw，那是你今早在 pm2 之外手动跑过 openclaw gateway。停掉，让 pm2 接管。

5. 钉钉用了 30 分钟之后没声了

长连接被上游 NAT 砍了。两条互相补强的修法：

1
2
3
4
"dingtalk": {
  "reconnectMs": 60000,
  "heartbeatMs":  30000
}

如果你能控制网络，把出口固定到一个 IP。轮换出口才是真正的根因——比"网络抖动"靠谱多了。

6. Agent 中途忘事

压缩跑了，但 memoryFlush 没开。回看第 7 篇 ——把 memoryFlush.enabled: true 设上，配一个合理的 softThresholdTokens。这一行配置，决定了"它忘了"还是"它记住了"。

7. `Token 烧得太快`

按可能性从高到低三种原因：

每一轮都在用最贵的模型。改成分级配置——qwen3.5-flash 做路由，qwen3-max 只在任务真要的时候用。
MEMORY.md 涨过了 40 行，每一轮都被加载。审一遍。
鸡毛蒜皮的事都拉子 Agent 处理。内联掉。

都不是的话：把每轮 token 日志打开，看真实分布。结果几乎从不是你猜的那个。

体检五行

每天我做的第一件事：

1
2
3
4
5
pm2 status openclaw-gateway
openclaw doctor
wc -l ~/.openclaw/workspace/MEMORY.md
ls ~/.openclaw/agents/main/sessions/*.jsonl | wc -l
df -h /

pm2 status 确认 supervisor 还在。openclaw doctor 跑内置检查。wc -l 看 MEMORY.md 有没有偷偷涨。session 数告诉我要不要归档。df -h 抓日志撑爆磁盘——它早晚会撑爆。

收尾

生产用的 OpenClaw 不是聪明的软件，是已经稳跑了三十天的无聊软件。按本子做部署、装好 supervisor、前面套个稳定代理、把上面七种故障在被咬之前先治好，你就到了。

QuickStart 系列到这儿结束。再往后路就分叉了——自定义 Skill、自定义渠道、自定义 MCP server、多 Agent 拓扑。所有这些都建立在这十篇打下的地基上。

部署

七种故障

1. 重启之后 command not found: openclaw

2. Node.js version too old

3. DashScope 报 401 Unauthorized

4. Gateway 启动 Connection refused