Datadog Observability
Overview
Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.
Core Capabilities:
- •APM: Distributed tracing with automatic instrumentation for 8+ languages
- •Infrastructure: Host, container, and cloud service monitoring
- •Logs: Centralized collection with processing pipelines and 15-month retention
- •Metrics: Custom metrics via DogStatsD with cardinality management
- •Synthetics: Proactive API and browser testing from 29+ global locations
- •RUM: Frontend performance with Core Web Vitals and session replay
When to Use This Skill
Activate when:
- •Setting up production monitoring and observability
- •Implementing distributed tracing across microservices
- •Configuring log aggregation and analysis pipelines
- •Creating custom metrics and dashboards
- •Setting up alerting and anomaly detection
- •Optimizing Datadog costs
Do not use when:
- •Building with open-source stack (use Prometheus/Grafana instead)
- •Cost is primary concern and budget is limited
- •Need maximum customization over managed solution
Quick Start
1. Install Datadog Agent
Docker (simplest):
bash
docker run -d --name dd-agent \ -e DD_API_KEY=<YOUR_API_KEY> \ -e DD_SITE="datadoghq.com" \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v /proc/:/host/proc/:ro \ -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \ gcr.io/datadoghq/agent:7
Kubernetes (Helm):
bash
helm repo add datadog https://helm.datadoghq.com helm install datadog-agent datadog/datadog \ --set datadog.apiKey=<YOUR_API_KEY> \ --set datadog.apm.enabled=true \ --set datadog.logs.enabled=true
2. Instrument Your Application
Python:
python
from ddtrace import tracer, patch_all
# Automatic instrumentation for common libraries
patch_all()
# Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span:
span.set_tag("user.id", user_id)
# your code here
Node.js:
javascript
// Must be first import
const tracer = require('dd-trace').init({
service: 'my-service',
env: 'production',
version: '1.0.0',
});
3. Verify in Datadog UI
- •Go to Infrastructure > Host Map to verify agent
- •Go to APM > Services to see traced services
- •Go to Logs > Search to verify log collection
Core Concepts
Tagging Strategy
Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.
Required Tags:
| Tag | Purpose | Example |
|---|---|---|
env | Environment | env:production |
service | Service name | service:api-gateway |
version | Deployment version | version:1.2.3 |
team | Owning team | team:platform |
Avoid High-Cardinality Tags:
- •User IDs, request IDs, timestamps
- •Pod IDs in Kubernetes
- •Build numbers, commit hashes
Unified Observability
Datadog correlates metrics, traces, and logs automatically:
- •Traces include span tags that link to metrics
- •Logs inject trace IDs for correlation
- •Dashboards combine all data sources
Best Practices
Start Simple
- •Install Agent with basic configuration
- •Enable automatic instrumentation
- •Verify data in Datadog UI
- •Add custom spans/metrics as needed
Progressive Enhancement
code
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM
Key Instrumentation Points
- •HTTP entry/exit points
- •Database queries
- •External service calls
- •Message queue operations
- •Business-critical flows
Common Mistakes
- •High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
- •Missing log index quotas: Leads to unexpected bills from log volume spikes
- •Over-alerting: Creates alert fatigue; alert on symptoms, not causes
- •Missing service tags: Prevents correlation between metrics, traces, and logs
- •No sampling for high-volume traces: Ingests everything, causing cost explosion
Navigation
For detailed implementation:
- •Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
- •APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
- •Log Management: Pipelines, Grok parsing, standard attributes, archives
- •Custom Metrics: DogStatsD patterns, metric types, tagging best practices
- •Alerting: Monitor types, anomaly detection, alert hygiene
- •Cost Optimization: Metrics without Limits, sampling, index quotas
- •Kubernetes: DaemonSet, Cluster Agent, autodiscovery
Complementary Skills
When using this skill, consider these related skills (if deployed):
- •docker: Container instrumentation patterns
- •kubernetes: K8s-native monitoring patterns
- •python/nodejs/go: Language-specific APM setup
Resources
Official Documentation:
- •APM: https://docs.datadoghq.com/tracing/
- •Logs: https://docs.datadoghq.com/logs/
- •Metrics: https://docs.datadoghq.com/metrics/
- •DogStatsD: https://docs.datadoghq.com/developers/dogstatsd/
Cost Management: