AgentSkillsCN

datadog-observability

借助 Datadog APM、日志、指标、合成监测与 RUM,实现全栈可观测性。当您需要为生产系统实施监控、追踪、告警,或进行成本优化时,可选用此方案。

SKILL.md
--- frontmatter
name: datadog-observability
description: Full-stack observability with Datadog APM, logs, metrics, synthetics, and RUM. Use when implementing monitoring, tracing, alerting, or cost optimization for production systems.
version: 1.0.0
category: platform
author: Claude MPM Team
license: MIT
progressive_disclosure:
  entry_point:
    summary: "Unified observability platform for APM, logs, metrics, synthetics, and RUM with 1000+ integrations."
    when_to_use: "When implementing production monitoring, distributed tracing, log aggregation, custom metrics, or cost optimization."
    quick_start: "1. Install Datadog Agent. 2. Enable APM with automatic instrumentation. 3. Configure log collection. 4. Set up alerts."
  references:
    - agent-installation.md
    - apm-instrumentation.md
    - log-management.md
    - custom-metrics.md
    - alerting.md
    - cost-optimization.md
    - kubernetes.md
context_limit: 800
tags:
  - observability
  - monitoring
  - apm
  - logging
  - metrics
  - tracing
  - datadog
  - alerting
requires_tools: []

Datadog Observability

Overview

Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.

Core Capabilities:

  • APM: Distributed tracing with automatic instrumentation for 8+ languages
  • Infrastructure: Host, container, and cloud service monitoring
  • Logs: Centralized collection with processing pipelines and 15-month retention
  • Metrics: Custom metrics via DogStatsD with cardinality management
  • Synthetics: Proactive API and browser testing from 29+ global locations
  • RUM: Frontend performance with Core Web Vitals and session replay

When to Use This Skill

Activate when:

  • Setting up production monitoring and observability
  • Implementing distributed tracing across microservices
  • Configuring log aggregation and analysis pipelines
  • Creating custom metrics and dashboards
  • Setting up alerting and anomaly detection
  • Optimizing Datadog costs

Do not use when:

  • Building with open-source stack (use Prometheus/Grafana instead)
  • Cost is primary concern and budget is limited
  • Need maximum customization over managed solution

Quick Start

1. Install Datadog Agent

Docker (simplest):

bash
docker run -d --name dd-agent \
  -e DD_API_KEY=<YOUR_API_KEY> \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7

Kubernetes (Helm):

bash
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<YOUR_API_KEY> \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true

2. Instrument Your Application

Python:

python
from ddtrace import tracer, patch_all

# Automatic instrumentation for common libraries
patch_all()

# Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span:
    span.set_tag("user.id", user_id)
    # your code here

Node.js:

javascript
// Must be first import
const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production',
  version: '1.0.0',
});

3. Verify in Datadog UI

  1. Go to Infrastructure > Host Map to verify agent
  2. Go to APM > Services to see traced services
  3. Go to Logs > Search to verify log collection

Core Concepts

Tagging Strategy

Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.

Required Tags:

TagPurposeExample
envEnvironmentenv:production
serviceService nameservice:api-gateway
versionDeployment versionversion:1.2.3
teamOwning teamteam:platform

Avoid High-Cardinality Tags:

  • User IDs, request IDs, timestamps
  • Pod IDs in Kubernetes
  • Build numbers, commit hashes

Unified Observability

Datadog correlates metrics, traces, and logs automatically:

  • Traces include span tags that link to metrics
  • Logs inject trace IDs for correlation
  • Dashboards combine all data sources

Best Practices

Start Simple

  1. Install Agent with basic configuration
  2. Enable automatic instrumentation
  3. Verify data in Datadog UI
  4. Add custom spans/metrics as needed

Progressive Enhancement

code
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM

Key Instrumentation Points

  • HTTP entry/exit points
  • Database queries
  • External service calls
  • Message queue operations
  • Business-critical flows

Common Mistakes

  1. High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
  2. Missing log index quotas: Leads to unexpected bills from log volume spikes
  3. Over-alerting: Creates alert fatigue; alert on symptoms, not causes
  4. Missing service tags: Prevents correlation between metrics, traces, and logs
  5. No sampling for high-volume traces: Ingests everything, causing cost explosion

Navigation

For detailed implementation:

  • Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
  • APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
  • Log Management: Pipelines, Grok parsing, standard attributes, archives
  • Custom Metrics: DogStatsD patterns, metric types, tagging best practices
  • Alerting: Monitor types, anomaly detection, alert hygiene
  • Cost Optimization: Metrics without Limits, sampling, index quotas
  • Kubernetes: DaemonSet, Cluster Agent, autodiscovery

Complementary Skills

When using this skill, consider these related skills (if deployed):

  • docker: Container instrumentation patterns
  • kubernetes: K8s-native monitoring patterns
  • python/nodejs/go: Language-specific APM setup

Resources

Official Documentation:

Cost Management: