← Back to Plugins
Voice

CrashRepairKit

Millerderek By Millerderek 👁 255 views ▲ 0 votes

gateway-watchdog is a self-healing skill for OpenClaw on Linux. It automatically detects and removes broken plugin entries, validates LLM model strings and API keys, and restarts the gateway cleanly β€” without getting stuck in the interactive doctor wizard. A systemd watchdog polls every 30 seconds, a startup hook fires on every gateway restart,

GitHub

README

# gateway-watchdog

An OpenClaw skill and systemd daemon that automatically detects and repairs gateway disconnections, unresolvable plugin entries, and LLM model mismatches β€” with a live CLI dashboard to watch it all in real time.

Built and tested on OpenClaw 2026.x running on Ubuntu with systemd.

---

## What It Does

| Problem | How It's Handled |
|---|---|
| Gateway crashes | `Restart=always` brings it back via systemd |
| Plugin not found (e.g. `retell-voice`) | `fix-plugins.sh` detects via jq + `openclaw doctor`, removes entry, restarts |
| Deprecated or mismatched model string | `check-models.sh` validates against known-good list, rewrites if needed |
| Invalid or expired API key | `check-models.sh` probes each provider's `/v1/models` endpoint live |
| Broken escalation chain | Chain integrity checked across all fallback models |
| Agent disconnects mid-session | Watchdog polls every 30s, triggers repair if unhealthy |
| Network down at startup | Model probing skipped entirely, plugin fix still runs |

---

## File Structure

```
gateway-watchdog/
β”œβ”€β”€ install.sh                    # One-command installer
β”œβ”€β”€ openclaw-startup-fix.sh       # Runs on every gateway start via ExecStartPost
β”œβ”€β”€ openclaw-watchdog.sh          # Systemd daemon, polls every 30s
β”œβ”€β”€ openclaw-watchdog.service     # Watchdog systemd unit
β”œβ”€β”€ openclaw-gateway.service      # Gateway systemd unit (reference / patch target)
β”œβ”€β”€ openclaw-watch.sh             # Live CLI dashboard
└── skill/
    β”œβ”€β”€ SKILL.md                  # OpenClaw skill definition
    β”œβ”€β”€ fix-plugins.sh            # Plugin detection and removal
    └── check-models.sh           # Model string, API key, and provider health checks
```

---

## Install

```bash
git clone <this repo>
cd gateway-watchdog
bash install.sh
```

Requires: `jq`, `bash`, `systemd`, Linux. `jq` will be installed automatically if missing.

After install, verify the `ExecStartPost` hook was applied to the gateway service:

```bash
systemctl cat openclaw-gateway.service | grep -E "ExecStart|Restart"
```

You should see both `ExecStart` and `ExecStartPost=/usr/local/bin/openclaw-startup-fix.sh`.

---

## Auto-Repair Flow

Every time the gateway starts (whether after a crash or manually), `openclaw-startup-fix.sh` fires automatically:

```
Gateway starts
  └─ ExecStartPost: openclaw-startup-fix.sh
       β”œβ”€ Backup config (timestamped, pruned after 7 days)
       β”œβ”€ Phase 1: Plugin fix       ← always runs, local only
       └─ Phase 2: Model health
            β”œβ”€ Network check        ← skip probing if offline
            β”œβ”€ Previous state check ← alert-only if last state was healthy
            └─ Auto-fix             ← only if state was already degraded
```

The watchdog daemon runs in parallel and catches issues that develop *after* startup β€” rate limits, keys expiring mid-session, provider outages.

---

## Safety Mitigations

Three independent gates prevent incorrect config changes:

**1. Network gate**
Before probing any provider endpoint, connectivity is verified with a 3-second ping. If the network is not up, the entire model-probing phase is skipped. The plugin fix still runs because it is purely local.

**2. Previous state gate**
The last recorded watchdog state is checked before any model auto-fix. If the state was `healthy`, the models were working before the crash β€” the script alerts and logs but does not touch the config. A human decision is required to force a model fix in that case.

**3. Backup before every write**
A timestamped backup is saved to `~/.openclaw/backups/` before any config change. Backups older than 7 days are pruned automatically. The worst case is always one manual restore:

```bash
cp ~/.openclaw/backups/openclaw.json.startup.<timestamp> ~/.openclaw/openclaw.json
systemctl restart openclaw-gateway.service
```

---

## Live Dashboard

```bash
# Live refresh every 5s
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --follow

# One-shot snapshot
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh

# Full model health report
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --models

# Run repair with animated progress bar
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --repair
```

In `--follow` mode:
- **R + Enter** β€” run auto-repair with live progress bar
- **M + Enter** β€” full model health report
- **Q + Enter** β€” quit

---

## Manual Commands

```bash
# Plugin fix only
bash ~/.openclaw/skills/gateway-watchdog/fix-plugins.sh

# Model health check (report only)
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh

# Model health check with auto-fix
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh --fix

# Full startup fix (same as what runs automatically)
bash /usr/local/bin/openclaw-startup-fix.sh

# Watchdog service
systemctl status openclaw-watchdog
journalctl -u openclaw-watchdog -f

# Startup fix log
tail -f /var/log/openclaw-startup-fix.log

# Watchdog log
tail -f /var/log/openclaw-watchdog.log
```

---

## Agent Slash Command

Once installed, type `/gateway-watchdog` in the OpenClaw TUI to trigger a full health check and repair cycle from within the agent itself.

---

## How Plugin Detection Works

The script does **not** call `openclaw gateway start` to detect bad plugins β€” that command opens an interactive wizard and hangs. Instead it uses two methods:

1. **jq + filesystem scan** β€” reads every key from `plugins.entries` in `openclaw.json`, checks if a corresponding directory or node module exists on disk. Instant, no CLI calls.
2. **`openclaw doctor` with a 10-second timeout** β€” supplements method 1 for broken installs where files exist but the plugin still fails to resolve. `</dev/null` closes stdin to prevent interactive prompts.

The `State dir migration skipped` warning that appears in every `openclaw doctor` run is filtered out β€” it is a cosmetic warning, not an error.

---

## Model Validation

`check-models.sh` checks:

- Primary model string format (`provider/model-name`)
- Known deprecated model strings with suggested replacements
- Fallback / escalation chain integrity and provider diversity
- API key presence (credentials file β†’ env var β†’ config file)
- Live provider endpoint reachability (HTTP 200 = valid, 401/403 = bad key, 429 = rate limited)
- Subagent and TTS summary model strings
- Local providers (Ollama, LMStudio) via local endpoint ping

Supported providers: Anthropic, OpenAI, Google/Gemini, Moonshot/Kimi, Mistral, Groq, OpenRouter, Ollama, LMStudio.

---

## After an OpenClaw Update

OpenClaw may overwrite its own systemd service file on update, removing the `ExecStartPost` hook. If that happens, re-run the installer:

```bash
bash install.sh
```

It is idempotent β€” safe to run multiple times.

---

## License

MIT. No warranties. Review before running on production systems.
voice

Comments

Sign in to leave a comment

Loading comments...