Health Checks Skill

Overview

This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.

Health Check Philosophy

Why Regular Health Checks?

•Proactive Detection: Find issues before users do
•Trend Identification: Spot degradation early
•Capacity Planning: Know when to scale
•Compliance: Maintain system hygiene
•Documentation: Track system state over time

Health Check Frequency

Check Type	Frequency	When
Quick	Every deploy	After any deployment
Daily	Daily	Morning/start of business
Weekly	Weekly	Beginning of week
Deep	Monthly	Beginning of month
Full Audit	Quarterly	Scheduled maintenance window

Health Status Framework

Traffic Light System

code

GREEN  - All systems healthy
         - No critical issues
         - Metrics within normal ranges
         - Advisory count: 0

YELLOW - Warning state
         - Non-critical issues present
         - Metrics approaching limits
         - Performance advisories present

RED    - Critical state
         - Service impaired or unavailable
         - Critical metrics exceeded
         - Security advisories present
         - Immediate action required

Status Determination Rules

Condition	Status
Security advisory exists	RED
Service unavailable	RED
Error rate > 5%	RED
Connection utilization > 85%	RED
CI success rate < 75%	RED
Performance advisory exists	YELLOW
Error rate 1-5%	YELLOW
Connection utilization 70-85%	YELLOW
CI success rate 75-90%	YELLOW
Long-running queries present	YELLOW
All metrics normal	GREEN

Health Metrics

Key Performance Indicators

Platform	Metric	Good	Warning	Critical
Database	Connection %	<70%	70-85%	>85%
Database	Query Duration	<100ms	100-500ms	>500ms
Database	Dead Rows %	<10%	10-20%	>20%
API	Error Rate	<1%	1-5%	>5%
API	Response Time P95	<500ms	500-2000ms	>2000ms
CI/CD	Success Rate	>90%	75-90%	<75%
CI/CD	Build Time	<5min	5-15min	>15min

Platform-Specific Metrics

Supabase

•API error rate
•Auth failure rate
•Storage utilization
•Edge function cold starts
•Realtime connection count
•Advisory count (security/performance)

GitHub Actions

•Workflow success rate
•Average build time
•Queue wait time
•Cache hit rate
•Failed workflow count

Railway

•Service uptime
•Deploy success rate
•Memory utilization
•CPU utilization
•Health check pass rate

Postgres

•Connection utilization
•Query duration distribution
•Lock contention
•Dead tuple ratio
•Index usage efficiency
•Table bloat

Health Check Procedures

Quick Health Check (5 min)

Purpose: Verify basic system functionality

code

1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status

Daily Health Check (15 min)

Purpose: Assess overall system health

code

1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization

Weekly Health Check (30 min)

Purpose: Comprehensive review and trending

code

1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations

Monthly Deep Check (1+ hours)

Purpose: Full system audit

code

1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test

Alert Thresholds

Immediate Alerts (Page)

•Service unavailable > 1 minute
•Error rate > 10%
•Database connections > 90%
•Security advisory created
•Deployment failure (production)
•Health check failure > 5 minutes

Warning Alerts (Slack/Email)

•Error rate > 2%
•Database connections > 75%
•Performance advisory created
•Build time increase > 50%
•Response time P95 > 1s
•Disk usage > 80%

Info Alerts (Daily Digest)

•New advisory (any type)
•Build time change
•Resource trend change
•Configuration change

Health Report Template

markdown

# Infrastructure Health Report

**Generated**: {TIMESTAMP}
**Report Type**: {Quick | Daily | Weekly | Monthly}
**Overall Status**: {GREEN | YELLOW | RED}

## Executive Summary
{2-3 sentence overview}

## Platform Status

| Platform | Status | Issues | Warnings |
|----------|--------|--------|----------|
| GitHub Actions | {STATUS} | {N} | {N} |
| Railway | {STATUS} | {N} | {N} |
| Supabase | {STATUS} | {N} | {N} |
| Postgres | {STATUS} | {N} | {N} |

## Key Metrics

### Database
- Connections: {N}/{MAX} ({PCT}%)
- Query P95: {MS}ms
- Dead Rows: {PCT}%

### API
- Error Rate: {PCT}%
- Response Time P95: {MS}ms

### CI/CD
- Success Rate: {PCT}%
- Avg Build Time: {MIN}m

## Advisories

### Security
{List or "None"}

### Performance
{List or "None"}

## Issues Requiring Attention

### Immediate
{List or "None"}

### This Week
{List or "None"}

## Trends

{Notable changes from previous period}

## Recommendations

{Specific actions to improve health}

---
*Next health check: {TIMESTAMP}*

Remediation Playbooks

High Connection Utilization

code

1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling

High Error Rate

code

1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related

Slow Queries

code

1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching

Build Failures

code

1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changes

See checklists.md for detailed health check checklists.