Voice
CrashRepairKit
gateway-watchdog is a self-healing skill for OpenClaw on Linux. It automatically detects and removes broken plugin entries, validates LLM model strings and API keys, and restarts the gateway cleanly β without getting stuck in the interactive doctor wizard. A systemd watchdog polls every 30 seconds, a startup hook fires on every gateway restart,
README
# gateway-watchdog
An OpenClaw skill and systemd daemon that automatically detects and repairs gateway disconnections, unresolvable plugin entries, and LLM model mismatches β with a live CLI dashboard to watch it all in real time.
Built and tested on OpenClaw 2026.x running on Ubuntu with systemd.
---
## What It Does
| Problem | How It's Handled |
|---|---|
| Gateway crashes | `Restart=always` brings it back via systemd |
| Plugin not found (e.g. `retell-voice`) | `fix-plugins.sh` detects via jq + `openclaw doctor`, removes entry, restarts |
| Deprecated or mismatched model string | `check-models.sh` validates against known-good list, rewrites if needed |
| Invalid or expired API key | `check-models.sh` probes each provider's `/v1/models` endpoint live |
| Broken escalation chain | Chain integrity checked across all fallback models |
| Agent disconnects mid-session | Watchdog polls every 30s, triggers repair if unhealthy |
| Network down at startup | Model probing skipped entirely, plugin fix still runs |
---
## File Structure
```
gateway-watchdog/
βββ install.sh # One-command installer
βββ openclaw-startup-fix.sh # Runs on every gateway start via ExecStartPost
βββ openclaw-watchdog.sh # Systemd daemon, polls every 30s
βββ openclaw-watchdog.service # Watchdog systemd unit
βββ openclaw-gateway.service # Gateway systemd unit (reference / patch target)
βββ openclaw-watch.sh # Live CLI dashboard
βββ skill/
βββ SKILL.md # OpenClaw skill definition
βββ fix-plugins.sh # Plugin detection and removal
βββ check-models.sh # Model string, API key, and provider health checks
```
---
## Install
```bash
git clone <this repo>
cd gateway-watchdog
bash install.sh
```
Requires: `jq`, `bash`, `systemd`, Linux. `jq` will be installed automatically if missing.
After install, verify the `ExecStartPost` hook was applied to the gateway service:
```bash
systemctl cat openclaw-gateway.service | grep -E "ExecStart|Restart"
```
You should see both `ExecStart` and `ExecStartPost=/usr/local/bin/openclaw-startup-fix.sh`.
---
## Auto-Repair Flow
Every time the gateway starts (whether after a crash or manually), `openclaw-startup-fix.sh` fires automatically:
```
Gateway starts
ββ ExecStartPost: openclaw-startup-fix.sh
ββ Backup config (timestamped, pruned after 7 days)
ββ Phase 1: Plugin fix β always runs, local only
ββ Phase 2: Model health
ββ Network check β skip probing if offline
ββ Previous state check β alert-only if last state was healthy
ββ Auto-fix β only if state was already degraded
```
The watchdog daemon runs in parallel and catches issues that develop *after* startup β rate limits, keys expiring mid-session, provider outages.
---
## Safety Mitigations
Three independent gates prevent incorrect config changes:
**1. Network gate**
Before probing any provider endpoint, connectivity is verified with a 3-second ping. If the network is not up, the entire model-probing phase is skipped. The plugin fix still runs because it is purely local.
**2. Previous state gate**
The last recorded watchdog state is checked before any model auto-fix. If the state was `healthy`, the models were working before the crash β the script alerts and logs but does not touch the config. A human decision is required to force a model fix in that case.
**3. Backup before every write**
A timestamped backup is saved to `~/.openclaw/backups/` before any config change. Backups older than 7 days are pruned automatically. The worst case is always one manual restore:
```bash
cp ~/.openclaw/backups/openclaw.json.startup.<timestamp> ~/.openclaw/openclaw.json
systemctl restart openclaw-gateway.service
```
---
## Live Dashboard
```bash
# Live refresh every 5s
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --follow
# One-shot snapshot
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh
# Full model health report
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --models
# Run repair with animated progress bar
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --repair
```
In `--follow` mode:
- **R + Enter** β run auto-repair with live progress bar
- **M + Enter** β full model health report
- **Q + Enter** β quit
---
## Manual Commands
```bash
# Plugin fix only
bash ~/.openclaw/skills/gateway-watchdog/fix-plugins.sh
# Model health check (report only)
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh
# Model health check with auto-fix
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh --fix
# Full startup fix (same as what runs automatically)
bash /usr/local/bin/openclaw-startup-fix.sh
# Watchdog service
systemctl status openclaw-watchdog
journalctl -u openclaw-watchdog -f
# Startup fix log
tail -f /var/log/openclaw-startup-fix.log
# Watchdog log
tail -f /var/log/openclaw-watchdog.log
```
---
## Agent Slash Command
Once installed, type `/gateway-watchdog` in the OpenClaw TUI to trigger a full health check and repair cycle from within the agent itself.
---
## How Plugin Detection Works
The script does **not** call `openclaw gateway start` to detect bad plugins β that command opens an interactive wizard and hangs. Instead it uses two methods:
1. **jq + filesystem scan** β reads every key from `plugins.entries` in `openclaw.json`, checks if a corresponding directory or node module exists on disk. Instant, no CLI calls.
2. **`openclaw doctor` with a 10-second timeout** β supplements method 1 for broken installs where files exist but the plugin still fails to resolve. `</dev/null` closes stdin to prevent interactive prompts.
The `State dir migration skipped` warning that appears in every `openclaw doctor` run is filtered out β it is a cosmetic warning, not an error.
---
## Model Validation
`check-models.sh` checks:
- Primary model string format (`provider/model-name`)
- Known deprecated model strings with suggested replacements
- Fallback / escalation chain integrity and provider diversity
- API key presence (credentials file β env var β config file)
- Live provider endpoint reachability (HTTP 200 = valid, 401/403 = bad key, 429 = rate limited)
- Subagent and TTS summary model strings
- Local providers (Ollama, LMStudio) via local endpoint ping
Supported providers: Anthropic, OpenAI, Google/Gemini, Moonshot/Kimi, Mistral, Groq, OpenRouter, Ollama, LMStudio.
---
## After an OpenClaw Update
OpenClaw may overwrite its own systemd service file on update, removing the `ExecStartPost` hook. If that happens, re-run the installer:
```bash
bash install.sh
```
It is idempotent β safe to run multiple times.
---
## License
MIT. No warranties. Review before running on production systems.
voice
Comments
Sign in to leave a comment