Grafana Dashboards
Build powerful monitoring and observability dashboards.
Instructions
- •Start with key metrics - CPU, memory, latency, error rates
- •Use consistent time ranges - All panels should sync
- •Add context with variables - Filter by environment, service, host
- •Set up alerts - Proactive monitoring, not reactive
- •Use templates - Consistent dashboard styling
Dashboard JSON Structure
json
{
"dashboard": {
"id": null,
"uid": "my-dashboard",
"title": "Service Overview",
"tags": ["production", "service-name"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"templating": { "list": [] },
"panels": []
}
}
Panel Types
Time Series
json
{
"type": "timeseries",
"title": "Request Rate",
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
},
"targets": [
{
"expr": "rate(http_requests_total{job=\"$job\"}[5m])",
"legendFormat": "{{method}} {{status}}"
}
]
}
Stat Panel
json
{
"type": "stat",
"title": "Total Requests",
"options": {
"colorMode": "value",
"graphMode": "area",
"reduceOptions": {
"calcs": ["lastNotNull"]
}
}
}
Gauge
json
{
"type": "gauge",
"title": "CPU Usage",
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 90 }
]
}
}
}
}
Prometheus Queries (PromQL)
Basic Queries
promql
# Instant rate (requests per second)
rate(http_requests_total[5m])
# Sum by label
sum by (status_code) (rate(http_requests_total[5m]))
# Average latency (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100
Aggregation & Filtering
promql
# Filter by label
http_requests_total{job="api", environment="production"}
# Regex match
http_requests_total{path=~"/api/v[0-9]+/.*"}
# Aggregations
sum(metric) # Total
avg(metric) # Average
max(metric) # Maximum
topk(5, metric) # Top 5 series
# Group by label
sum by (instance) (metric)
Variables (Templating)
json
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus"
},
{
"name": "environment",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(up, environment)",
"refresh": 1,
"multi": false,
"includeAll": true
}
]
}
}
Usage in queries:
promql
rate(http_requests_total{environment=~"$environment"}[$interval])
Alerting
json
{
"alert": "HighErrorRate",
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) > 0.05",
"for": "5m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "High error rate detected"
}
}
Dashboard Provisioning
File Structure
code
grafana/
+-- provisioning/
| +-- dashboards/
| | +-- dashboards.yaml
| +-- datasources/
| +-- datasources.yaml
+-- dashboards/
+-- overview.json
Datasources Config
yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Common Dashboard Patterns
RED Method (Request, Error, Duration)
promql
# Request Rate
sum(rate(http_requests_total[5m]))
# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Duration (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
USE Method (Utilization, Saturation, Errors)
promql
# CPU Utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Saturation
node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes
# Network Errors
rate(node_network_receive_errs_total[5m])
Best Practices
- •Use consistent colors - Red for errors, green for success
- •Add descriptions - Panel descriptions explain what's shown
- •Set meaningful thresholds - Color changes at important values
- •Link related dashboards - Drill-down from overview to details
- •Version control dashboards - Store JSON in git
- •Use dashboard folders - Organize by team or service
When to Use This Skill
- •Infrastructure monitoring
- •Application performance monitoring
- •Business metrics dashboards
- •Real-time operational dashboards
- •SLA/SLO tracking
- •Building observability platforms