Set Up Uptime Checks
Monitor service availability from external vantage points and prevent SSL certificate expirations.
When to Use
- •Monitoring customer-facing endpoints (APIs, websites)
- •Tracking SSL certificate expiration
- •Validating service availability from multiple regions
- •Creating public status pages
- •Meeting SLA requirements for uptime reporting
Inputs
- •Required: List of HTTP/HTTPS endpoints to monitor
- •Required: Prometheus instance for metric collection
- •Optional: Multiple geographic probe locations
- •Optional: Status page tool (Statuspage.io, Cachet, custom)
- •Optional: Alert notification channels (PagerDuty, Slack)
Procedure
Step 1: Deploy Blackbox Exporter
Install Blackbox Exporter via Docker or Kubernetes:
# Docker deployment docker run -d \ --name blackbox-exporter \ -p 9115:9115 \ -v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml \ prom/blackbox-exporter:latest \ --config.file=/etc/blackbox_exporter/config.yml
Kubernetes deployment:
# blackbox-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: blackbox-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
containers:
- name: blackbox-exporter
image: prom/blackbox-exporter:latest
ports:
- containerPort: 9115
volumeMounts:
- name: config
mountPath: /etc/blackbox_exporter
volumes:
- name: config
configMap:
name: blackbox-exporter-config
---
apiVersion: v1
kind: Service
metadata:
name: blackbox-exporter
namespace: monitoring
spec:
selector:
app: blackbox-exporter
ports:
- port: 9115
targetPort: 9115
Expected: Blackbox Exporter running and accessible on port 9115.
On failure: Check firewall rules, ensure config volume is mounted correctly.
Step 2: Configure Blackbox Modules
Create blackbox.yml with various probe types:
# blackbox.yml
modules:
# Basic HTTP 200 check
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
# HTTP with authentication
http_2xx_auth:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
headers:
Authorization: "Bearer ${AUTH_TOKEN}"
# API health check (expects JSON response)
http_json_health:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
fail_if_body_not_matches_regexp:
- '"status":"healthy"'
# SSL certificate check
http_2xx_ssl:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
tls_config:
insecure_skip_verify: false
fail_if_ssl_not_present: true
# TCP port check (e.g., database)
tcp_connect:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: "ip4"
# ICMP ping
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
# DNS resolution check
dns_google:
prober: dns
timeout: 5s
dns:
query_name: "google.com"
query_type: "A"
valid_rcodes:
- NOERROR
Load config into Kubernetes:
kubectl create configmap blackbox-exporter-config \ -n monitoring \ --from-file=blackbox.yml \ --dry-run=client -o yaml | kubectl apply -f -
Expected: Multiple probe modules configured for different check types.
On failure: Validate YAML syntax. Check Blackbox Exporter logs for config errors.
Step 3: Configure Prometheus Scrape
Add Blackbox targets to Prometheus config:
# prometheus.yml
scrape_configs:
# Blackbox exporter itself
- job_name: 'blackbox-exporter'
static_configs:
- targets: ['blackbox-exporter:9115']
# HTTP endpoint checks
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.company.com/health
- https://www.company.com
- https://app.company.com/login
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# SSL certificate expiry checks
- job_name: 'blackbox-ssl'
metrics_path: /probe
params:
module: [http_2xx_ssl]
static_configs:
- targets:
- https://api.company.com
- https://www.company.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# TCP connectivity checks (databases, etc.)
- job_name: 'blackbox-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- postgres.internal:5432
- redis.internal:6379
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Reload Prometheus config:
# Reload Prometheus (if running in Docker) docker exec prometheus kill -HUP 1 # Or Kubernetes kubectl rollout restart deployment/prometheus -n monitoring
Expected: Prometheus scraping Blackbox Exporter, metrics visible in Prometheus UI.
On failure: Check Prometheus logs for scrape errors. Verify Blackbox Exporter is reachable.
Step 4: Create Uptime Alerts
Define alerting rules:
# uptime-alerts.yml
groups:
- name: uptime
interval: 30s
rules:
- alert: EndpointDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for 2 minutes."
- alert: SSLCertificateExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
description: "Certificate expires on {{ $labels.instance }}. Renew soon."
- alert: SSLCertificateExpired
expr: (probe_ssl_earliest_cert_expiry - time()) < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SSL certificate for {{ $labels.instance }} has EXPIRED"
description: "URGENT: Certificate expired. Service may be inaccessible."
- alert: SlowResponseTime
expr: probe_http_duration_seconds > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response from {{ $labels.instance }}"
description: "HTTP request took {{ $value }}s (threshold: 3s)."
- alert: HTTPStatusNot200
expr: probe_http_status_code != 200
for: 2m
labels:
severity: warning
annotations:
summary: "HTTP {{ $labels.instance }} returned {{ $value }}"
description: "Expected 200, got {{ $value }}."
Load into Prometheus:
# Add to prometheus.yml rule_files: - /etc/prometheus/uptime-alerts.yml # Reload docker exec prometheus kill -HUP 1
Expected: Alerts fire when endpoints are unreachable or SSL certs expiring.
On failure: Check Prometheus alerts page for rule evaluation errors.
Step 5: Build Uptime Dashboard
Create Grafana dashboard:
{
"dashboard": {
"title": "Uptime Monitoring",
"panels": [
{
"title": "Endpoint Availability (7 days)",
"type": "stat",
"targets": [
{
"expr": "avg_over_time(probe_success{job=\"blackbox-http\"}[7d]) * 100",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 99, "color": "yellow" },
{ "value": 99.9, "color": "green" }
]
}
}
}
},
{
"title": "SSL Certificate Expiry (Days)",
"type": "graph",
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry - time()) / 86400",
"legendFormat": "{{ instance }}"
}
],
"yaxes": [
{ "label": "Days", "format": "short" }
],
"alert": {
"conditions": [
{
"evaluator": { "params": [14], "type": "lt" },
"query": { "params": ["A", "5m", "now"] }
}
]
}
},
{
"title": "Response Time (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(probe_http_duration_seconds_bucket[5m]))",
"legendFormat": "{{ instance }}"
}
]
},
{
"title": "HTTP Status Codes (Last Hour)",
"type": "table",
"targets": [
{
"expr": "probe_http_status_code",
"format": "table",
"instant": true
}
]
}
]
}
}
Expected: Dashboard showing uptime %, SSL expiry, response times.
On failure: Check Prometheus data source in Grafana, verify metrics are being scraped.
Step 6: Set Up Status Page
Option A: Use Statuspage.io (SaaS):
# Integrate with Statuspage.io API
curl -X POST https://api.statuspage.io/v1/pages/PAGE_ID/incidents \
-H "Authorization: OAuth YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"incident": {
"name": "API Degradation",
"status": "investigating",
"impact_override": "minor",
"body": "We are investigating elevated error rates on the API."
}
}'
Option B: Self-hosted Cachet:
# docker-compose.yml for Cachet
version: '3'
services:
cachet:
image: cachethq/docker:latest
ports:
- "8000:8000"
environment:
DB_DRIVER: pgsql
DB_HOST: postgres
DB_DATABASE: cachet
DB_USERNAME: cachet
DB_PASSWORD: secret
APP_KEY: base64:YOUR_APP_KEY
APP_URL: https://status.company.com
depends_on:
- postgres
postgres:
image: postgres:13
environment:
POSTGRES_DB: cachet
POSTGRES_USER: cachet
POSTGRES_PASSWORD: secret
volumes:
- postgres-data:/var/lib/postgresql/data
volumes:
postgres-data:
Option C: Custom status page from Prometheus metrics:
<!-- Simple status page (served via Nginx or GitHub Pages) -->
<!DOCTYPE html>
<html>
<head>
<title>Company Status</title>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
</head>
<body>
<h1>Service Status</h1>
<div id="services"></div>
<script>
async function fetchStatus() {
const response = await axios.get('https://prometheus.company.com/api/v1/query?query=probe_success');
const services = response.data.data.result;
const html = services.map(s => {
const status = s.value[1] == '1' ? '✅ Operational' : '❌ Down';
return `<p><strong>${s.metric.instance}</strong>: ${status}</p>`;
}).join('');
document.getElementById('services').innerHTML = html;
}
fetchStatus();
setInterval(fetchStatus, 60000); // Refresh every minute
</script>
</body>
</html>
Expected: Public status page shows current service status and incidents.
On failure: Ensure status page URL is reachable by customers, not behind VPN.
Validation
- • Blackbox Exporter deployed and accessible
- • Prometheus scraping Blackbox metrics
- • Uptime checks configured for all critical endpoints
- • SSL certificate expiry alerts configured (14-day warning)
- • Alerts tested (simulate endpoint down, check alert fires)
- • Grafana dashboard shows uptime and SSL expiry
- • Status page accessible to customers
- • Alert notifications reach on-call engineers
Common Pitfalls
- •Internal-only checks: Blackbox Exporter inside cluster can't detect external DNS/routing issues. Deploy probes in multiple clouds/regions.
- •Too frequent scraping: Checking every 10 seconds generates load. 30-60s is usually sufficient.
- •No SSL monitoring: Expired certificates are embarrassing and preventable. Always monitor.
- •Status page not automated: Manually updating status pages during incidents wastes time. Automate from Prometheus alerts.
- •False positives: Single failed check shouldn't alert. Use
for: 2mto avoid transient network blips.
Related Skills
- •
configure-alerting-rules- create alerts for uptime failures - •
setup-prometheus-monitoring- Prometheus backend for Blackbox Exporter