Homelab Intelligence Skill
Purpose: Gather comprehensive system intelligence, analyze health, and provide actionable recommendations for the homelab infrastructure.
When to use: When you need to understand current system state, diagnose issues, or provide recommendations for maintenance/improvements.
Triggers:
- •User asks "how is the system?"
- •User requests health check or diagnostics
- •User mentions issues or performance concerns
- •User asks specific questions about services, resources, or configuration
- •Before making significant changes to the infrastructure
- •Periodically for proactive monitoring
Quick Query System (NEW: 2025-11-22)
For specific questions, use the natural language query system first:
~/containers/scripts/query-homelab.sh "Your question here"
Supported query types:
- •Resource usage: "What services are using the most memory?", "Show me disk usage"
- •Service status: "Is jellyfin running?", "Show me recent restarts"
- •Network topology: "What's on the reverse_proxy network?"
- •Configuration: "What's jellyfin's configuration?"
Benefits:
- •✅ Instant responses (<1s) from cache
- •✅ No need to run full intel script for simple questions
- •✅ Production-ready and safety-tested
When to use query system vs full intel:
- •Query system: Specific, quick questions about current state
- •Full intel: Comprehensive health assessment, troubleshooting, recommendations
Instructions
When this skill is invoked, follow this workflow:
Step 1: Run Intelligence Gathering
Execute the homelab intelligence script to collect current system state:
cd ~/containers ./scripts/homelab-intel.sh
Note: Script always generates JSON report in docs/99-reports/intel-<timestamp>.json
This will:
- •Check system basics (uptime, SELinux, kernel updates)
- •Analyze disk usage (system SSD and BTRFS pool)
- •Verify all critical services are running
- •Measure resource usage (memory, swap, load average)
- •Check backup status (local logs, external drive, BTRFS snapshots)
- •Verify SSL certificate validity (Let's Encrypt)
- •Test monitoring stack health (Prometheus, Grafana, Loki via container exec)
- •Assess network connectivity (internet reachability)
Step 2: Analyze the Output
Read and parse the JSON output from the script. Pay special attention to:
Critical Issues (Priority 1):
- •These require immediate action
- •May indicate system instability or security concerns
- •Examples: Services down, disk >80%, SELinux disabled, no internet
Warnings (Priority 2):
- •Should be addressed soon
- •May become critical if ignored
- •Examples: Disk >70%, no recent backup, high memory usage
Info Items:
- •Informational status updates
- •Positive confirmations of healthy state
- •Examples: All services running, monitoring healthy
Health Score:
- •90-100: Excellent health
- •75-89: Good health, minor issues
- •50-74: Degraded, needs attention
- •0-49: Critical state, immediate action required
Step 3: Provide Context-Aware Recommendations
Based on the analysis, provide specific, actionable recommendations:
For Critical Issues:
- •Explain the impact of the issue
- •Provide step-by-step resolution
- •Reference relevant documentation in
docs/if applicable - •Mention related ADRs if architectural decisions are involved
For Warnings:
- •Explain when this might become critical
- •Suggest preventive actions
- •Provide commands to investigate further
For General Health:
- •Summarize overall system state
- •Highlight any trends (improving/degrading)
- •Suggest proactive improvements
Step 4: Check for Patterns
Look for common patterns that might indicate deeper issues:
Disk Space Issues:
- •Check if journal logs are growing (suggest rotation)
- •Look for container layer accumulation (suggest pruning)
- •Review backup log retention
Service Issues:
- •Check if services failed after recent changes (review git log)
- •Verify quadlet syntax if services won't start
- •Check network connectivity for monitoring stack
Resource Pressure:
- •High memory + swap = need to review container limits
- •High CPU + Jellyfin = likely transcoding (normal)
- •High disk I/O = check if backup is running
Step 5: Reference Documentation
When providing recommendations, link to relevant documentation:
Troubleshooting:
- •Service issues: Reference
CLAUDE.mdTroubleshooting Workflow section - •Disk issues: Reference
docs/20-operations/guides/storage-layout.md - •Backup issues: Reference
docs/20-operations/guides/backup-strategy.md
Architecture Decisions:
- •If suggesting changes, check ADRs in
docs/*/decisions/ - •Reference
CLAUDE.mdADR section for key decisions
Service-Specific Issues:
- •Traefik:
docs/10-services/guides/traefik.md - •Jellyfin:
docs/10-services/guides/jellyfin.md - •Monitoring:
docs/40-monitoring-and-documentation/guides/monitoring-stack.md
Step 6: Suggest Follow-Up Actions
Based on findings, suggest next steps:
If Health Score < 75:
- •Run detailed diagnostics:
./scripts/homelab-diagnose.sh - •Review service logs:
journalctl --user -u <service>.service -n 100 - •Check recent changes:
git log --oneline -10
If Backup Overdue:
- •Check external drive:
df -h | grep WD-18TB - •Run manual backup:
./scripts/btrfs-snapshot-backup.sh - •Review backup logs
If Disk Usage High:
- •Identify large files:
du -sh ~/containers/data/* | sort -h - •Check journal size:
journalctl --user --disk-usage - •Review cleanup options from CLAUDE.md
Example Workflow
User: "How is my homelab doing?"
Claude Response:
- •(Runs intelligence script)
- •(Reads JSON output)
- •(Analyzes results)
"I've run a comprehensive health check of your homelab. Here's the summary:
Overall Health: 87/100 ✅ Healthy
Good News:
- •All 12 services running normally
- •SSL certificates valid (62 days until renewal)
- •Monitoring stack (Prometheus, Grafana, Loki) fully operational
- •Memory usage normal at 24%
Warnings Found (2):
- •
System SSD at 58% capacity (W001)
- •Current: 75GB / 128GB used
- •Trend: +2% since last week
- •Recommendation: Review
journalctl --user --disk-usageand consider rotating old logs - •Reference: See CLAUDE.md "High Disk Usage" troubleshooting
- •
Backup slightly overdue (I002)
- •Last backup: 9 days ago
- •Expected: Weekly (Sunday)
- •Recommendation: Verify external drive mounted and run
./scripts/btrfs-snapshot-backup.sh - •Reference: docs/20-operations/guides/backup-strategy.md
Recommended Actions:
- •[MEDIUM] Review disk usage and clean up old logs
- •[LOW] Run weekly backup when external drive is available
Would you like me to help with any of these items?"
Common Scenarios
Scenario 1: Critical Service Down
If intelligence script shows critical services failed:
- •Identify which service(s) failed
- •Check recent systemd journal:
journalctl --user -u <service>.service -n 50 - •Look for error patterns (common: network, permissions, port conflicts)
- •Reference CLAUDE.md "Container Won't Start" troubleshooting
- •Suggest specific fix based on error
Scenario 2: High Disk Usage
If system SSD >70%:
- •Run
du -sh ~/containers/data/* | sort -hto identify culprits - •Check journal size:
journalctl --user --disk-usage - •Suggest cleanup commands from CLAUDE.md "High Disk Usage"
- •Explain consequences if ignored (system may freeze at 100%)
Scenario 3: Monitoring Stack Issues
If Prometheus/Grafana/Loki health checks fail:
- •Check each service individually:
systemctl --user status <service>.service - •Verify network connectivity (services must be on monitoring network)
- •Check datasource UIDs in Grafana provisioning
- •Reference docs/40-monitoring-and-documentation/guides/monitoring-stack.md
Scenario 4: Everything Healthy
If health score >90 and no issues:
- •Acknowledge healthy state
- •Highlight any positive trends (e.g., disk usage stable, uptime high)
- •Suggest proactive actions (review Grafana dashboards, test backup restore)
- •Ask if user wants to work on planned improvements from docs/40-monitoring-and-documentation/journal/
Context Framework Integration
When troubleshooting, leverage the Context Framework for historical awareness:
Query Past Issues
cd ~/containers/.claude/context/scripts # Check if this problem has occurred before ./query-issues.sh --category disk-space # Disk issues ./query-issues.sh --category deployment # Deployment issues ./query-issues.sh --status resolved # See what worked before
Query Deployment History
# How was a service originally deployed? ./query-deployments.sh --service jellyfin ./query-deployments.sh --pattern monitoring-stack
Auto-Remediation
For common issues, use the remediation playbooks:
cd ~/containers/.claude/remediation/scripts # Disk cleanup (safe, no confirmation needed) ./apply-remediation.sh --playbook disk-cleanup --dry-run # Preview first ./apply-remediation.sh --playbook disk-cleanup # Execute # Service restart (with logging) ./apply-remediation.sh --playbook service-restart --service prometheus
Available playbooks: disk-cleanup, service-restart, drift-reconciliation, resource-pressure
See ~/.claude/QUICK-REFERENCE.md for full command reference.
Integration with Other Skills
This skill works well with:
- •Context Framework: Query issue history and deployment patterns
- •Auto-Remediation: Execute playbooks for common fixes
- •homelab-deployment: Verify system health before/after deployments
- •systematic-debugging: Use when issues require deeper investigation
Output Format
Always structure your response as:
- •Health Score & Status (with emoji for visual clarity)
- •Critical Issues (if any - these are urgent)
- •Warnings (if any - these need attention)
- •Positive Findings (what's working well)
- •Key Metrics (uptime, resource usage, service count)
- •Recommended Actions (prioritized list)
- •Offer to Help (ask if user wants assistance with any item)
Keep responses concise but actionable. Always provide specific commands or file references.
Notes
- •v2.0 improvements: Always generates JSON output, improved monitoring health checks via
podman exec, better backup detection (3 locations), smarter swap threshold - •v2.1 (2025-11-28): Added Context Framework and Auto-Remediation integration
- •JSON reports automatically saved to
~/containers/docs/99-reports/intel-<timestamp>.json - •Script is safe to run frequently (no side effects, read-only operations)
- •Health scoring algorithm: Start at 100, -20 for critical issues, -5 for warnings
- •Exit codes: 0=healthy, 1=warning, 2=critical (useful for automation)
- •Full script reference:
docs/20-operations/guides/automation-reference.md