AgentSkillsCN

infra-guardian

OpenClaw Agent Infrastructure Guardian——守护您的智能体基础设施,确保其始终稳定运行。通过分离式执行与故障自动重启,实现进程生命周期管理;借助 Cron 定时器进行健康监测(按作业粒度检测、自动恢复);独立于 OpenClaw 的 Telegram/消息通知,可直接发送警报;以系统级看门狗的形式运行于 crontab 中,而非 OpenClaw 的 Cron 任务中。当您启动后台进程、监控 Cron 作业健康状况,或面对那些总是默默消亡的任务时,可使用此技能。

SKILL.md
--- frontmatter
name: infra-guardian
description: "OpenClaw Agent Infrastructure Guardian — keep your agent's infrastructure alive. Process lifecycle management with detached execution, auto-restart on failure. Cron scheduler health monitoring (per-job detection, auto-recovery). Direct Telegram/messaging alerts independent of OpenClaw. System-level watchdog that runs from crontab, not OpenClaw cron. Use when launching background processes, monitoring cron job health, or when things keep dying silently."

Infra Guardian

Keep your OpenClaw agent's infrastructure alive. Processes, cron jobs, the works.

What It Does

LayerWhatHow
Process ManagementLaunch, track, auto-restart background processessetsid + nohup + registry + healthcheck
Cron HealthDetect stalled OpenClaw cron jobs per-jobReads jobs.json from disk, checks nextRunAtMs vs interval
Auto-RecoveryRestart gateway when cron scheduler stallsSIGUSR1 when >50% jobs overdue
AlertingTelegram alerts independent of OpenClawDirect Bot API calls — works even if OpenClaw is down

Core principle: Monitoring components must not depend on the system they monitor.

Quick Start

bash
# Process management
bash scripts/managed-process.sh register my-bot "python3 /path/to/bot.py" 480
bash scripts/managed-process.sh start my-bot
bash scripts/managed-process.sh status

# Infrastructure watchdog (add to system crontab, NOT OpenClaw cron)
*/10 * * * * /path/to/scripts/managed-process.sh watchdog

Commands

CommandUsageDescription
registerregister <name> <command> [duration_min]Define a managed process. Duration 0 = indefinite.
startstart <name>Launch registered process (fully detached).
stopstop <name>Graceful shutdown via SIGTERM.
restartrestart <name>Stop then start.
statusstatus [name]Show all processes or one specific.
healthcheckhealthcheckCheck all registered processes, restart dead ones.
watchdogwatchdogUnified check: cron health + process health (for system crontab).
cron-healthcron-healthCheck OpenClaw cron scheduler per-job health.
proc-healthproc-healthCheck key process liveness (configurable patterns).
deregisterderegister <name>Remove process from registry + clean up files.

Infrastructure Watchdog

The watchdog command is the unified health check. Run it from system crontab — never from OpenClaw cron (you can't monitor the scheduler using the scheduler).

bash
# Unified (recommended)
bash scripts/managed-process.sh watchdog

# Individual checks
bash scripts/managed-process.sh cron-health
bash scripts/managed-process.sh proc-health

Cron Health (cron-health)

Reads OpenClaw's cron state directly from disk (~/.openclaw/cron/jobs.json). No API dependency.

Per-job detection:

  • Each enabled job checked independently
  • nextRunAtMs overdue by >2× its interval → STALE
  • Jobs with kind: "every" use their everyMs as interval
  • Jobs with kind: "cron" use 24h as max expected interval

Auto-recovery:

  • If >50% of jobs are stale → sends SIGUSR1 to restart gateway
  • Alerts via Telegram Bot API directly (reads bot token from OpenClaw config)
  • 30-minute cooldown between alerts (no spam)

Why this matters: OpenClaw has a known cron bug (#8424) where kind: "cron" jobs permanently stall after missing a run. This watchdog catches it within 20 minutes instead of discovering it 8 hours later.

Process Health (proc-health)

Checks known background processes via pgrep. Configurable patterns in the script.

  • Alerts via Telegram if any monitored process is down
  • 30-minute cooldown

Setup

bash
# Add to system crontab (not OpenClaw cron!)
crontab -e
# Add this line:
*/10 * * * * /home/clawdbot/clawd/scripts/managed-process.sh watchdog

Process Management

The Problem

When AI agents launch background processes via exec & or nohup, those processes are tied to the parent session. Session ends → child processes get SIGTERM → die silently → nobody knows.

The Rule

ALL long-running processes MUST go through this framework. No exceptions.

  • python script.py &
  • nohup python script.py &
  • exec with background: true
  • bash scripts/managed-process.sh register <name> <cmd> then start <name>

Detached Execution

Processes launch via setsid + nohup + disown, giving them:

  • Own session ID (SID) — not tied to any terminal
  • PPID=1 (init) — survives parent death
  • Immune to SIGHUP/session cleanup

Health Monitoring

The healthcheck command (can run from cron every 5 min):

  1. Checks each registered process is alive (PID exists)
  2. If dead: checks if it completed normally (ran ≥80% of expected duration)
  3. If premature death: auto-restarts and writes alert flag
  4. Alert cooldown: 15 min between alerts (no spam)

Alert Integration

When a process dies, the healthcheck writes:

  • /tmp/process_monitor_alert.flag — trigger file
  • /tmp/process_monitor_alert.txt — alert message

Configure your agent's heartbeat to check these files and forward alerts.

Process Registry

All processes are tracked in .process-registry.json:

json
{
  "my-bot": {
    "command": "python3 /path/to/bot.py",
    "duration_min": 480,
    "auto_restart": true,
    "max_restarts": 5,
    "restart_cooldown": 30
  }
}

Signal Handling Best Practice

Your scripts should log signals, not silently exit:

python
import signal
from datetime import datetime

def handler(signum, frame):
    sig_name = signal.Signals(signum).name
    print(f"⚠️ SIGNAL: {sig_name} at {datetime.now()}", flush=True)
    global shutdown
    shutdown = True

signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

Never call sys.exit(0) in signal handlers — it makes crashes look like clean exits, preventing watchdogs from restarting.

Architecture

code
System Crontab (every 10 min)
  └─ managed-process.sh watchdog
       ├─ cron-health: reads ~/.openclaw/cron/jobs.json
       │    ├─ per-job: nextRunAtMs overdue > 2× interval?
       │    ├─ if >50% stale → SIGUSR1 gateway restart
       │    └─ alert → Telegram Bot API (direct, no OpenClaw)
       └─ proc-health: pgrep known patterns
            └─ alert → Telegram Bot API (direct, no OpenClaw)

Independence chain: System crontab → bash script → disk read → direct Telegram API. Zero OpenClaw dependencies in the monitoring path.