AgentSkillsCN

Toolchains Platforms Observability Datadog

工具链——Python异步:Asyncio

SKILL.md

Datadog Observability

Overview

Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.

Core Capabilities:

  • APM: Distributed tracing with automatic instrumentation for 8+ languages
  • Infrastructure: Host, container, and cloud service monitoring
  • Logs: Centralized collection with processing pipelines and 15-month retention
  • Metrics: Custom metrics via DogStatsD with cardinality management
  • Synthetics: Proactive API and browser testing from 29+ global locations
  • RUM: Frontend performance with Core Web Vitals and session replay

When to Use This Skill

Activate when:

  • Setting up production monitoring and observability
  • Implementing distributed tracing across microservices
  • Configuring log aggregation and analysis pipelines
  • Creating custom metrics and dashboards
  • Setting up alerting and anomaly detection
  • Optimizing Datadog costs

Do not use when:

  • Building with open-source stack (use Prometheus/Grafana instead)
  • Cost is primary concern and budget is limited
  • Need maximum customization over managed solution

Quick Start

1. Install Datadog Agent

Docker (simplest):

bash
docker run -d --name dd-agent \
  -e DD_API_KEY=<YOUR_API_KEY> \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7

Kubernetes (Helm):

bash
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<YOUR_API_KEY> \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true

2. Instrument Your Application

Python:

python
from ddtrace import tracer, patch_all

# Automatic instrumentation for common libraries
patch_all()

# Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span:
    span.set_tag("user.id", user_id)
    # your code here

Node.js:

javascript
// Must be first import
const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production',
  version: '1.0.0',
});

3. Verify in Datadog UI

  1. Go to Infrastructure > Host Map to verify agent
  2. Go to APM > Services to see traced services
  3. Go to Logs > Search to verify log collection

Core Concepts

Tagging Strategy

Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.

Required Tags:

TagPurposeExample
envEnvironmentenv:production
serviceService nameservice:api-gateway
versionDeployment versionversion:1.2.3
teamOwning teamteam:platform

Avoid High-Cardinality Tags:

  • User IDs, request IDs, timestamps
  • Pod IDs in Kubernetes
  • Build numbers, commit hashes

Unified Observability

Datadog correlates metrics, traces, and logs automatically:

  • Traces include span tags that link to metrics
  • Logs inject trace IDs for correlation
  • Dashboards combine all data sources

Best Practices

Start Simple

  1. Install Agent with basic configuration
  2. Enable automatic instrumentation
  3. Verify data in Datadog UI
  4. Add custom spans/metrics as needed

Progressive Enhancement

code
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM

Key Instrumentation Points

  • HTTP entry/exit points
  • Database queries
  • External service calls
  • Message queue operations
  • Business-critical flows

Common Mistakes

  1. High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
  2. Missing log index quotas: Leads to unexpected bills from log volume spikes
  3. Over-alerting: Creates alert fatigue; alert on symptoms, not causes
  4. Missing service tags: Prevents correlation between metrics, traces, and logs
  5. No sampling for high-volume traces: Ingests everything, causing cost explosion

Navigation

For detailed implementation:

  • Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
  • APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
  • Log Management: Pipelines, Grok parsing, standard attributes, archives
  • Custom Metrics: DogStatsD patterns, metric types, tagging best practices
  • Alerting: Monitor types, anomaly detection, alert hygiene
  • Cost Optimization: Metrics without Limits, sampling, index quotas
  • Kubernetes: DaemonSet, Cluster Agent, autodiscovery

Complementary Skills

When using this skill, consider these related skills (if deployed):

  • docker: Container instrumentation patterns
  • kubernetes: K8s-native monitoring patterns
  • python/nodejs/go: Language-specific APM setup

Resources

Official Documentation:

Cost Management: