Datadog Operations

Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.

What This Skill Does

Investigation & Analysis:

•Query APM traces to identify performance bottlenecks
•Search logs for error patterns and anomalies
•Detect security threats and attack attempts
•Analyze Watchdog anomaly detection alerts
•Query metrics with statistical analysis
•Analyze Datadog usage and costs (FinOps)
•Monitor LLM observability for GenAI applications
•Query SLO status and error budgets
•List services from service catalog
•Analyze database query performance
•Track frontend performance with RUM

Automation & Creation:

•Create and manage monitors with alert thresholds
•Generate dashboards for APM, security, costs, and LLM observability
•Trigger Datadog workflows for incident response
•Create and update incidents
•Mute/unmute monitors during maintenance
•Create synthetic uptime checks and browser tests

Prerequisites

Set environment variables:

bash

export DD_API_KEY=your_api_key
export DD_APP_KEY=your_application_key
export DD_SITE=datadoghq.com  # or datadoghq.eu, us3.datadoghq.com, etc.

Get keys from Datadog: Organization Settings > API Keys and Application Keys

Working Scripts

1. Query APM Performance

Find slow endpoints and performance issues:

bash

bash scripts/query-apm.sh --service my-service --duration 1h --limit 20

Returns:

•Endpoints sorted by P95 latency
•Request counts per endpoint
•P50, P95, P99 latency
•Problem endpoints (P95 > 500ms)

2. Query Security Signals

Find security threats and attack attempts:

bash

bash scripts/query-security-signals.sh --service my-service --duration 24h

Returns:

•Security signals by severity (critical, high, medium, low)
•Attack types (SQL injection, XSS, auth failures)
•Affected services and hosts
•Recent security events with details

3. Query Watchdog Anomalies

Automated anomaly detection from Datadog Watchdog:

bash

bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d

Returns:

•Anomalies by type (latency, error_rate, traffic)
•Affected services and resources
•Start timestamps and severity
•Baseline vs observed values

4. Search Logs

Search logs for error patterns:

bash

bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h

Returns:

•Error messages grouped by frequency
•Associated trace IDs for investigation
•Service and host breakdowns
•Common error patterns

5. Query Metrics

Fetch metric data with statistical analysis:

bash

bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h

Returns:

•Time series data
•Statistics (min, max, avg, p50, p95, p99)
•Trend analysis (increasing, decreasing, stable)
•Anomaly detection (values > 2 std dev)

6. Analyze Usage and Costs

FinOps cost analysis and optimization:

bash

bash scripts/analyze-usage-cost.sh --duration 30d --product all

Returns:

•APM span ingestion (indexed vs ingested)
•Log volume breakdown
•Infrastructure hosts and container hours
•Custom metrics count
•Estimated monthly costs by product
•Cost optimization recommendations

7. Analyze LLM Performance

For GenAI applications, analyze LLM observability data:

bash

bash scripts/analyze-llm.sh --service my-llm-app --duration 24h

Returns:

•Token usage statistics (prompt + completion)
•Cost estimates based on model pricing
•Model latency (P50, P95, P99)
•Error rates by model
•Most expensive operations
•Token usage trends

8. Manage Monitors

Create, list, mute, and manage Datadog monitors:

bash

# List all monitors
bash scripts/manage-monitors.sh list

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "High Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
  --message "Error rate is high @slack-alerts"

# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2

# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345

Returns:

•Monitor list with states (alert, warn, OK)
•Created monitor ID and details
•Mute/unmute confirmations

9. Create Dashboards

Generate dashboards from templates:

bash

# Create APM performance dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security monitoring dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security

# Create cost analysis dashboard
bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost

# Create LLM observability dashboard
bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm

Dashboard types:

•apm: Latency, errors, throughput by endpoint
•logs: Log volume and error analysis
•security: Security threats and attack patterns
•cost: APM, logs, infrastructure costs
•llm: Token usage, costs, model performance

10. Query SLOs

Check Service Level Objectives and error budgets:

bash

# List all SLOs
bash scripts/query-slos.sh

# List SLOs for service
bash scripts/query-slos.sh --service payment-api

# List SLOs with tag
bash scripts/query-slos.sh --tag team:backend

Returns:

•SLO status (breaching, warning, OK)
•Current value vs target threshold
•Error budget remaining
•Error budget status (exhausted, low, healthy)

11. Trigger Workflows

Execute Datadog workflow automation:

bash

# List available workflows
bash scripts/trigger-workflow.sh list

# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123

# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'

Returns:

•Workflow list with IDs and descriptions
•Workflow instance ID when triggered
•Execution status

12. Manage Incidents

Create and manage incident response:

bash

# List active incidents
bash scripts/manage-incidents.sh list --status active

# Create critical incident
bash scripts/manage-incidents.sh create \
  --title "Payment API Down" \
  --service payment-api \
  --severity SEV-1

# Update incident status
bash scripts/manage-incidents.sh update --id abc123 --status resolved

# Get incident details
bash scripts/manage-incidents.sh get --id abc123

Returns:

•Incident list with status and severity
•Created incident ID and details
•Incident timeline and updates

13. Query Service Catalog

List services and ownership metadata:

bash

# List all services
bash scripts/query-service-catalog.sh list

# List services for team
bash scripts/query-service-catalog.sh list --team backend

# Get service details
bash scripts/query-service-catalog.sh get --service payment-api

Returns:

•Service metadata (kind, tier, lifecycle)
•Team ownership and contacts
•Repository links
•Integration details

14. Manage Synthetic Tests

Create uptime checks and API tests:

bash

# List all synthetic tests
bash scripts/manage-synthetics.sh list

# Create API uptime check
bash scripts/manage-synthetics.sh create-api \
  --name "Payment API Uptime" \
  --url "https://api.example.com/health" \
  --method GET

# Create browser test
bash scripts/manage-synthetics.sh create-browser \
  --name "Login Flow" \
  --url "https://app.example.com/login"

# Get test results
bash scripts/manage-synthetics.sh get --id abc-123-def

Returns:

•Test list with status (active, paused)
•Created test ID and configuration
•Test results and uptime status

15. Query Database Performance

Analyze database queries and performance:

bash

# Query database performance
bash scripts/query-database.sh --host postgres-prod --duration 1h

# Get slow queries
bash scripts/query-database.sh --host mysql-01 --duration 24h

Returns:

•Slow query patterns
•P95/avg query duration
•Connection metrics
•Top queries by latency

16. Query RUM (Real User Monitoring)

Analyze frontend performance and user experience:

bash

# Query RUM data for application
bash scripts/query-rum.sh --application abc-123-def --duration 1h

# Get page load performance
bash scripts/query-rum.sh --application abc-123-def --duration 24h

Returns:

•Page load times (avg, P95)
•Frontend errors
•Top pages by traffic
•Error rate and types

17. Verify Setup

Validate Datadog configuration:

bash

bash scripts/verify-setup.sh

Returns:

•Environment variable validation
•Agent connectivity check
•Tracer installation detection

Incident Investigation Workflow

When investigating production issues:

1. Identify scope

bash

# Check for security threats
bash scripts/query-security-signals.sh --severity critical --duration 1h

# Check for anomalies
bash scripts/query-watchdog.sh --service affected-service --duration 24h

2. Find performance issues

bash

# Find slow endpoints
bash scripts/query-apm.sh --service affected-service --duration 1h

# Check error patterns
bash scripts/search-logs.sh --service affected-service --status error --duration 1h

3. Analyze metrics

bash

# Check latency trends
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h

# Check error rate trends
bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h

4. Get specific traces

bash

# Get error traces
bash scripts/query-apm.sh --service affected-service --status error --limit 10

# Search logs for trace context
bash scripts/search-logs.sh --query "trace_id:abc123def456"

Security Analysis Workflow

Monitor and investigate security threats:

bash

# Check critical security signals
bash scripts/query-security-signals.sh --severity critical --duration 7d

# Analyze specific service
bash scripts/query-security-signals.sh --service payment-api --duration 24h

# Search for attack patterns in logs
bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h

Cost Optimization Workflow

Analyze and optimize Datadog costs:

bash

# Get full cost breakdown
bash scripts/analyze-usage-cost.sh --duration 30d --product all

# Focus on APM costs
bash scripts/analyze-usage-cost.sh --duration 30d --product apm

# Extract high-priority recommendations
bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")'

# Track weekly trends
bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'

LLM Observability Workflow

For GenAI applications, monitor token usage and costs:

bash

# Analyze LLM performance
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h

# Filter by specific model
bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d

# Find most expensive operations
bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]'

# Track token usage trends
bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'

Deployment Impact Analysis

Compare metrics before/after deployment:

bash

# Before deployment
bash scripts/query-apm.sh --service my-service --duration 1h > before.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json

# Deploy...

# After deployment
bash scripts/query-apm.sh --service my-service --duration 1h > after.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json

# Compare latency
jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json

# Check for new errors
bash scripts/search-logs.sh --service my-service --status error --duration 30m

Monitor Creation Workflow

Set up monitoring for new services:

bash

# Create latency monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - High Latency" \
  --query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
  --message "P95 latency above 500ms @slack-ops"

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
  --message "Error rate above 5% @pagerduty"

# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security

Incident Response Workflow

Automated incident management:

bash

# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'

# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
  --title "Payment API SLO Breach" \
  --service payment-api \
  --severity SEV-2

# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'

# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
  jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
  xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2

# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved

SLO Monitoring Workflow

Track service level objectives:

bash

# Check all SLOs
bash scripts/query-slos.sh

# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
  bash scripts/manage-incidents.sh create \
    --title "Error Budget Exhausted" \
    --service affected-service \
    --severity SEV-3
fi

# Weekly SLO report
bash scripts/query-slos.sh | jq '{
  total: .total_slos,
  breaching: .summary.breaching,
  low_budget: .summary.budget_low,
  at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'

Output Format

All scripts return structured JSON for programmatic parsing:

json

{
  "status": "ok|warning|critical|error",
  "summary": {
    "...": "aggregated metrics"
  },
  "data": [...],
  "recommendations": [...]
}

Status messages go to stderr, JSON to stdout. This allows:

bash

# Silent execution, capture JSON
bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary'

# Log messages only
bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null

# Both
bash scripts/query-apm.sh --service my-service --duration 1h

Best Practices

Query Optimization

•Use specific time ranges to reduce API calls
•Filter by service/environment early
•Paginate large result sets
•Cache results when appropriate

Alert Investigation

•Start with Watchdog anomalies (automated detection)
•Correlate security signals with application errors
•Check metrics for trend confirmation
•Search logs for detailed context

Cost Control

•Run analyze-usage-cost.sh monthly
•Implement high-priority recommendations first
•Monitor sampling rates for high-volume services
•Track custom metric growth

Security Monitoring

•Query security signals daily (automated check)
•Filter by critical severity for alerting
•Correlate with log patterns
•Track attack trends over time

Limitations

•API rate limits apply (varies by endpoint)
•Historical data retention depends on Datadog plan
•Real-time queries have eventual consistency
•Requires live Datadog data (APM, logs, security monitoring)

Resources

Notes

This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.

Investigation: Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.

Automation: Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.

All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.