Monitoring Stack Deployer
This skill provides automated assistance for monitoring stack deployer tasks.
Overview
Deploys monitoring stacks (Prometheus/Grafana/Datadog) including collectors, scraping config, dashboards, and alerting rules for production systems.
Prerequisites
Before using this skill, ensure:
- •Target infrastructure is identified (Kubernetes, Docker, bare metal)
- •Metric endpoints are accessible from monitoring platform
- •Storage backend is configured for time-series data
- •Alert notification channels are defined (email, Slack, PagerDuty)
- •Resource requirements are calculated based on scale
Instructions
- •Select Platform: Choose Prometheus/Grafana, Datadog, or hybrid approach
- •Deploy Collectors: Install exporters and agents on monitored systems
- •Configure Scraping: Define metric collection endpoints and intervals
- •Set Up Storage: Configure retention policies and data compaction
- •Create Dashboards: Build visualization panels for key metrics
- •Define Alerts: Create alerting rules with appropriate thresholds
- •Test Monitoring: Verify metrics flow and alert triggering
Output
Prometheus + Grafana (Kubernetes):
yaml
# {baseDir}/monitoring/prometheus.yaml
## Overview
This skill provides automated assistance for the described functionality.
## Examples
Example usage patterns will be demonstrated in context.
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
ports:
- containerPort: 9090
Grafana Dashboard Configuration:
json
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total[5m])"
}
]
}
]
}
}
Error Handling
Metrics Not Appearing
- •Error: "No data points"
- •Solution: Verify scrape targets are accessible and returning metrics
High Cardinality
- •Error: "Too many time series"
- •Solution: Reduce label combinations or increase Prometheus resources
Alert Not Firing
- •Error: "Alert condition met but no notification"
- •Solution: Check Alertmanager configuration and notification channels
Dashboard Load Failure
- •Error: "Failed to load dashboard"
- •Solution: Verify Grafana datasource configuration and permissions
Examples
- •"Deploy Prometheus + Grafana on Kubernetes and add alerts for high error rate and latency."
- •"Install Datadog agents across hosts and configure a dashboard for CPU/memory saturation."
Resources
- •Prometheus documentation: https://prometheus.io/docs/
- •Grafana documentation: https://grafana.com/docs/
- •Example dashboards in {baseDir}/monitoring-examples/