Datadog Operations
Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.
What This Skill Does
Investigation & Analysis:
- •Query APM traces to identify performance bottlenecks
- •Search logs for error patterns and anomalies
- •Detect security threats and attack attempts
- •Analyze Watchdog anomaly detection alerts
- •Query metrics with statistical analysis
- •Analyze Datadog usage and costs (FinOps)
- •Monitor LLM observability for GenAI applications
- •Query SLO status and error budgets
- •List services from service catalog
- •Analyze database query performance
- •Track frontend performance with RUM
Automation & Creation:
- •Create and manage monitors with alert thresholds
- •Generate dashboards for APM, security, costs, and LLM observability
- •Trigger Datadog workflows for incident response
- •Create and update incidents
- •Mute/unmute monitors during maintenance
- •Create synthetic uptime checks and browser tests
Prerequisites
Set environment variables:
export DD_API_KEY=your_api_key export DD_APP_KEY=your_application_key export DD_SITE=datadoghq.com # or datadoghq.eu, us3.datadoghq.com, etc.
Get keys from Datadog: Organization Settings > API Keys and Application Keys
Working Scripts
1. Query APM Performance
Find slow endpoints and performance issues:
bash scripts/query-apm.sh --service my-service --duration 1h --limit 20
Returns:
- •Endpoints sorted by P95 latency
- •Request counts per endpoint
- •P50, P95, P99 latency
- •Problem endpoints (P95 > 500ms)
2. Query Security Signals
Find security threats and attack attempts:
bash scripts/query-security-signals.sh --service my-service --duration 24h
Returns:
- •Security signals by severity (critical, high, medium, low)
- •Attack types (SQL injection, XSS, auth failures)
- •Affected services and hosts
- •Recent security events with details
3. Query Watchdog Anomalies
Automated anomaly detection from Datadog Watchdog:
bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d
Returns:
- •Anomalies by type (latency, error_rate, traffic)
- •Affected services and resources
- •Start timestamps and severity
- •Baseline vs observed values
4. Search Logs
Search logs for error patterns:
bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h
Returns:
- •Error messages grouped by frequency
- •Associated trace IDs for investigation
- •Service and host breakdowns
- •Common error patterns
5. Query Metrics
Fetch metric data with statistical analysis:
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h
Returns:
- •Time series data
- •Statistics (min, max, avg, p50, p95, p99)
- •Trend analysis (increasing, decreasing, stable)
- •Anomaly detection (values > 2 std dev)
6. Analyze Usage and Costs
FinOps cost analysis and optimization:
bash scripts/analyze-usage-cost.sh --duration 30d --product all
Returns:
- •APM span ingestion (indexed vs ingested)
- •Log volume breakdown
- •Infrastructure hosts and container hours
- •Custom metrics count
- •Estimated monthly costs by product
- •Cost optimization recommendations
7. Analyze LLM Performance
For GenAI applications, analyze LLM observability data:
bash scripts/analyze-llm.sh --service my-llm-app --duration 24h
Returns:
- •Token usage statistics (prompt + completion)
- •Cost estimates based on model pricing
- •Model latency (P50, P95, P99)
- •Error rates by model
- •Most expensive operations
- •Token usage trends
8. Manage Monitors
Create, list, mute, and manage Datadog monitors:
# List all monitors
bash scripts/manage-monitors.sh list
# Create error rate monitor
bash scripts/manage-monitors.sh create \
--name "High Error Rate" \
--query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
--message "Error rate is high @slack-alerts"
# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2
# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345
Returns:
- •Monitor list with states (alert, warn, OK)
- •Created monitor ID and details
- •Mute/unmute confirmations
9. Create Dashboards
Generate dashboards from templates:
# Create APM performance dashboard bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm # Create security monitoring dashboard bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security # Create cost analysis dashboard bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost # Create LLM observability dashboard bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm
Dashboard types:
- •apm: Latency, errors, throughput by endpoint
- •logs: Log volume and error analysis
- •security: Security threats and attack patterns
- •cost: APM, logs, infrastructure costs
- •llm: Token usage, costs, model performance
10. Query SLOs
Check Service Level Objectives and error budgets:
# List all SLOs bash scripts/query-slos.sh # List SLOs for service bash scripts/query-slos.sh --service payment-api # List SLOs with tag bash scripts/query-slos.sh --tag team:backend
Returns:
- •SLO status (breaching, warning, OK)
- •Current value vs target threshold
- •Error budget remaining
- •Error budget status (exhausted, low, healthy)
11. Trigger Workflows
Execute Datadog workflow automation:
# List available workflows
bash scripts/trigger-workflow.sh list
# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123
# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'
Returns:
- •Workflow list with IDs and descriptions
- •Workflow instance ID when triggered
- •Execution status
12. Manage Incidents
Create and manage incident response:
# List active incidents bash scripts/manage-incidents.sh list --status active # Create critical incident bash scripts/manage-incidents.sh create \ --title "Payment API Down" \ --service payment-api \ --severity SEV-1 # Update incident status bash scripts/manage-incidents.sh update --id abc123 --status resolved # Get incident details bash scripts/manage-incidents.sh get --id abc123
Returns:
- •Incident list with status and severity
- •Created incident ID and details
- •Incident timeline and updates
13. Query Service Catalog
List services and ownership metadata:
# List all services bash scripts/query-service-catalog.sh list # List services for team bash scripts/query-service-catalog.sh list --team backend # Get service details bash scripts/query-service-catalog.sh get --service payment-api
Returns:
- •Service metadata (kind, tier, lifecycle)
- •Team ownership and contacts
- •Repository links
- •Integration details
14. Manage Synthetic Tests
Create uptime checks and API tests:
# List all synthetic tests bash scripts/manage-synthetics.sh list # Create API uptime check bash scripts/manage-synthetics.sh create-api \ --name "Payment API Uptime" \ --url "https://api.example.com/health" \ --method GET # Create browser test bash scripts/manage-synthetics.sh create-browser \ --name "Login Flow" \ --url "https://app.example.com/login" # Get test results bash scripts/manage-synthetics.sh get --id abc-123-def
Returns:
- •Test list with status (active, paused)
- •Created test ID and configuration
- •Test results and uptime status
15. Query Database Performance
Analyze database queries and performance:
# Query database performance bash scripts/query-database.sh --host postgres-prod --duration 1h # Get slow queries bash scripts/query-database.sh --host mysql-01 --duration 24h
Returns:
- •Slow query patterns
- •P95/avg query duration
- •Connection metrics
- •Top queries by latency
16. Query RUM (Real User Monitoring)
Analyze frontend performance and user experience:
# Query RUM data for application bash scripts/query-rum.sh --application abc-123-def --duration 1h # Get page load performance bash scripts/query-rum.sh --application abc-123-def --duration 24h
Returns:
- •Page load times (avg, P95)
- •Frontend errors
- •Top pages by traffic
- •Error rate and types
17. Verify Setup
Validate Datadog configuration:
bash scripts/verify-setup.sh
Returns:
- •Environment variable validation
- •Agent connectivity check
- •Tracer installation detection
Incident Investigation Workflow
When investigating production issues:
1. Identify scope
# Check for security threats bash scripts/query-security-signals.sh --severity critical --duration 1h # Check for anomalies bash scripts/query-watchdog.sh --service affected-service --duration 24h
2. Find performance issues
# Find slow endpoints bash scripts/query-apm.sh --service affected-service --duration 1h # Check error patterns bash scripts/search-logs.sh --service affected-service --status error --duration 1h
3. Analyze metrics
# Check latency trends bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h # Check error rate trends bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h
4. Get specific traces
# Get error traces bash scripts/query-apm.sh --service affected-service --status error --limit 10 # Search logs for trace context bash scripts/search-logs.sh --query "trace_id:abc123def456"
Security Analysis Workflow
Monitor and investigate security threats:
# Check critical security signals bash scripts/query-security-signals.sh --severity critical --duration 7d # Analyze specific service bash scripts/query-security-signals.sh --service payment-api --duration 24h # Search for attack patterns in logs bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h
Cost Optimization Workflow
Analyze and optimize Datadog costs:
# Get full cost breakdown bash scripts/analyze-usage-cost.sh --duration 30d --product all # Focus on APM costs bash scripts/analyze-usage-cost.sh --duration 30d --product apm # Extract high-priority recommendations bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")' # Track weekly trends bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'
LLM Observability Workflow
For GenAI applications, monitor token usage and costs:
# Analyze LLM performance bash scripts/analyze-llm.sh --service my-genai-app --duration 24h # Filter by specific model bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d # Find most expensive operations bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]' # Track token usage trends bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'
Deployment Impact Analysis
Compare metrics before/after deployment:
# Before deployment bash scripts/query-apm.sh --service my-service --duration 1h > before.json bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json # Deploy... # After deployment bash scripts/query-apm.sh --service my-service --duration 1h > after.json bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json # Compare latency jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json # Check for new errors bash scripts/search-logs.sh --service my-service --status error --duration 30m
Monitor Creation Workflow
Set up monitoring for new services:
# Create latency monitor
bash scripts/manage-monitors.sh create \
--name "Payment API - High Latency" \
--query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
--message "P95 latency above 500ms @slack-ops"
# Create error rate monitor
bash scripts/manage-monitors.sh create \
--name "Payment API - Error Rate" \
--query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
--message "Error rate above 5% @pagerduty"
# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm
# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security
Incident Response Workflow
Automated incident management:
# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'
# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
--title "Payment API SLO Breach" \
--service payment-api \
--severity SEV-2
# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'
# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2
# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved
SLO Monitoring Workflow
Track service level objectives:
# Check all SLOs
bash scripts/query-slos.sh
# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
bash scripts/manage-incidents.sh create \
--title "Error Budget Exhausted" \
--service affected-service \
--severity SEV-3
fi
# Weekly SLO report
bash scripts/query-slos.sh | jq '{
total: .total_slos,
breaching: .summary.breaching,
low_budget: .summary.budget_low,
at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'
Output Format
All scripts return structured JSON for programmatic parsing:
{
"status": "ok|warning|critical|error",
"summary": {
"...": "aggregated metrics"
},
"data": [...],
"recommendations": [...]
}
Status messages go to stderr, JSON to stdout. This allows:
# Silent execution, capture JSON bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary' # Log messages only bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null # Both bash scripts/query-apm.sh --service my-service --duration 1h
Best Practices
Query Optimization
- •Use specific time ranges to reduce API calls
- •Filter by service/environment early
- •Paginate large result sets
- •Cache results when appropriate
Alert Investigation
- •Start with Watchdog anomalies (automated detection)
- •Correlate security signals with application errors
- •Check metrics for trend confirmation
- •Search logs for detailed context
Cost Control
- •Run analyze-usage-cost.sh monthly
- •Implement high-priority recommendations first
- •Monitor sampling rates for high-volume services
- •Track custom metric growth
Security Monitoring
- •Query security signals daily (automated check)
- •Filter by critical severity for alerting
- •Correlate with log patterns
- •Track attack trends over time
Limitations
- •API rate limits apply (varies by endpoint)
- •Historical data retention depends on Datadog plan
- •Real-time queries have eventual consistency
- •Requires live Datadog data (APM, logs, security monitoring)
Resources
- •Datadog API Documentation
- •APM Query Syntax
- •Log Query Syntax
- •Watchdog Alerts
- •Security Monitoring
- •Usage Metering API
Notes
This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.
Investigation: Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.
Automation: Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.
All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.