AgentSkillsCN

Observability Management

可观测性管理

SKILL.md

Observability Management Skill

Template for Monitoring, Logging, and Alerting Specialists Version: 1.0.0 Created: 2026-01-09

Skill Purpose

Implement comprehensive observability with monitoring, structured logging, distributed tracing, and proactive alerting to maintain system reliability and quickly diagnose issues.

When to Activate This Skill

Activate when:

  • Setting up application monitoring
  • Implementing structured logging
  • Creating alerting rules
  • Debugging production issues
  • Performance optimization
  • SLA/SLO tracking
  • Incident response preparation
  • Capacity planning

Delegation from main agent:

code
@agent:[AGENT_NAME] "Set up application performance monitoring"
@agent:[AGENT_NAME] "Implement structured logging with request tracing"
@agent:[AGENT_NAME] "Create alerting rules for critical errors"
@agent:[AGENT_NAME] "Add health check endpoints"

Core Capabilities

Metrics & Monitoring

  • Application performance metrics (APM)
  • Infrastructure monitoring (CPU, memory, disk)
  • Custom business metrics
  • Real-time dashboards
  • Historical trend analysis
  • SLO/SLA tracking

Logging

  • Structured logging (JSON format)
  • Centralized log aggregation
  • Log retention policies
  • Search and filtering
  • Correlation with traces
  • Security audit logs

Alerting

  • Threshold-based alerts
  • Anomaly detection
  • Alert routing and escalation
  • On-call scheduling
  • Incident management
  • Alert fatigue prevention

Tracing

  • Distributed request tracing
  • Performance bottleneck identification
  • Service dependency mapping
  • Error propagation tracking
  • Database query analysis

[TECH_STACK_SPECIFIC] Platform Solutions

DigitalOcean Monitoring (Built-in)

Enable App Platform Metrics:

yaml
# app.yaml
name: myapp-production

services:
  - name: web
    # ... other config ...

    # Health check endpoint
    health_check:
      http_path: /health
      initial_delay_seconds: 30
      period_seconds: 10
      timeout_seconds: 5
      success_threshold: 1
      failure_threshold: 3

    # Built-in metrics
    # - CPU usage
    # - Memory usage
    # - Request rate
    # - Response time
    # - Error rate

Access Metrics:

bash
# Via CLI
doctl apps list-alerts <app-id>
doctl apps get-metrics <app-id>

# Via API
curl -X GET \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  https://api.digitalocean.com/v2/apps/<app-id>/metrics

Rails Application Instrumentation

Health Check Endpoint:

ruby
# config/routes.rb
Rails.application.routes.draw do
  get '/health', to: 'health#show'
  # ... other routes
end

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!

  def show
    checks = {
      database: database_check,
      redis: redis_check,
      storage: storage_check,
      timestamp: Time.current.iso8601
    }

    if checks.values.all?
      render json: { status: 'healthy', checks: checks }, status: :ok
    else
      render json: { status: 'unhealthy', checks: checks }, status: :service_unavailable
    end
  end

  private

  def database_check
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  rescue StandardError => e
    Rails.logger.error("Database health check failed: #{e.message}")
    false
  end

  def redis_check
    Redis.current.ping == 'PONG'
  rescue StandardError => e
    Rails.logger.error("Redis health check failed: #{e.message}")
    false
  end

  def storage_check
    ActiveStorage::Blob.service.exist?('health_check')
    true
  rescue StandardError => e
    Rails.logger.error("Storage health check failed: #{e.message}")
    false
  end
end

Structured Logging:

ruby
# config/environments/production.rb
config.log_formatter = Logger::Formatter.new

# Use JSON formatter
config.log_formatter = proc do |severity, datetime, progname, msg|
  {
    timestamp: datetime.iso8601,
    severity: severity,
    message: msg,
    pid: Process.pid,
    hostname: Socket.gethostname
  }.to_json + "\n"
end

# Log to stdout (12-factor app)
config.logger = ActiveSupport::Logger.new(STDOUT)
config.log_level = :info

# config/initializers/lograge.rb
Rails.application.configure do
  config.lograge.enabled = true
  config.lograge.formatter = Lograge::Formatters::Json.new

  config.lograge.custom_options = lambda do |event|
    {
      user_id: event.payload[:user_id],
      request_id: event.payload[:request_id],
      ip: event.payload[:ip],
      params: event.payload[:params].except('controller', 'action'),
      exception: event.payload[:exception]&.first,
      exception_message: event.payload[:exception]&.last
    }
  end
end

Request ID Tracking:

ruby
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
  before_action :set_request_id

  private

  def set_request_id
    RequestStore.store[:request_id] = request.uuid
  end
end

# Use in logs
Rails.logger.info("Processing payment",
  request_id: RequestStore.store[:request_id],
  user_id: current_user.id,
  amount: params[:amount]
)

Custom Metrics:

ruby
# app/models/concerns/metric_tracking.rb
module MetricTracking
  extend ActiveSupport::Concern

  included do
    after_create :track_create_metric
    after_update :track_update_metric
  end

  private

  def track_create_metric
    StatsD.increment("#{self.class.name.underscore}.created")
  end

  def track_update_metric
    StatsD.increment("#{self.class.name.underscore}.updated")
  end
end

# Usage in models
class Order < ApplicationRecord
  include MetricTracking

  def complete!
    transaction do
      update!(status: :completed, completed_at: Time.current)
      StatsD.timing('order.completion_time', Time.current - created_at)
      StatsD.increment('order.completed')
    end
  end
end

External APM Solutions

New Relic (Recommended)

ruby
# Gemfile
gem 'newrelic_rpm'

# config/newrelic.yml
common: &default_settings
  license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
  app_name: <%= ENV['NEW_RELIC_APP_NAME'] || 'My Application' %>
  distributed_tracing:
    enabled: true
  application_logging:
    enabled: true
    forwarding:
      enabled: true

production:
  <<: *default_settings
  monitor_mode: true
  log_level: info

# Custom instrumentation
class OrdersController < ApplicationController
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

  def create
    # ... order creation logic

    NewRelic::Agent.record_custom_event('OrderCreated', {
      order_id: @order.id,
      total: @order.total,
      user_id: current_user.id
    })
  end

  add_transaction_tracer :create, category: :task
end

Datadog

ruby
# Gemfile
gem 'ddtrace'

# config/initializers/datadog.rb
Datadog.configure do |c|
  c.tracing.instrument :rails
  c.tracing.instrument :redis
  c.tracing.instrument :pg
  c.tracing.instrument :http

  c.env = Rails.env
  c.service = ENV['DD_SERVICE'] || 'myapp'
  c.version = ENV['GIT_COMMIT_SHA'] || '1.0.0'

  c.tracing.analytics_enabled = true
end

# Custom metrics
Datadog::Statsd.new('localhost', 8125).tap do |statsd|
  statsd.increment('orders.created', tags: ['env:production'])
  statsd.gauge('users.active', User.where('last_seen_at > ?', 5.minutes.ago).count)
  statsd.histogram('payment.amount', payment.amount_cents)
end

Sentry (Error Tracking)

ruby
# Gemfile
gem 'sentry-ruby'
gem 'sentry-rails'

# config/initializers/sentry.rb
Sentry.init do |config|
  config.dsn = ENV['SENTRY_DSN']
  config.breadcrumbs_logger = [:active_support_logger, :http_logger]
  config.traces_sample_rate = 0.1  # 10% of requests
  config.environment = Rails.env
  config.enabled_environments = %w[production staging]

  config.before_send = lambda do |event, hint|
    # Filter sensitive data
    event.request.data.delete('password') if event.request.data
    event
  end
end

# Usage
begin
  risky_operation
rescue StandardError => e
  Sentry.capture_exception(e, extra: { user_id: current_user.id })
  raise
end

# Performance monitoring
Sentry.with_scope do |scope|
  scope.set_context('payment', { amount: payment.amount, method: payment.method })
  Sentry.capture_message('Large payment processed', level: :warning)
end

Log Aggregation

Papertrail (Simple, Affordable)

bash
# Add remote syslog destination
# DigitalOcean App Platform → Settings → Logs → Add destination

# Search logs via CLI
papertrail --min-time '5 minutes ago' 'error'
papertrail 'user_id:123' --json

Logtail (Modern Alternative)

ruby
# Gemfile
gem 'logtail-ruby'

# config/initializers/logtail.rb
if Rails.env.production?
  Rails.logger = Logtail::Logger.new(ENV['LOGTAIL_SOURCE_TOKEN'])
end

# Structured logging
Rails.logger.info('Order created', {
  order_id: order.id,
  user_id: current_user.id,
  total: order.total,
  items_count: order.items.count
})

ELK Stack (Self-hosted)

yaml
# docker-compose.yml for ELK
version: '3.9'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5000:5000"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

volumes:
  elasticsearch_data:

Alerting Configuration

DigitalOcean Alerts

bash
# Create alert via CLI
doctl monitoring alert create \
  --type v1/insights/droplet/cpu \
  --compare GreaterThan \
  --value 80 \
  --window 5m \
  --entities droplet_id_1,droplet_id_2 \
  --emails alerts@example.com

# App Platform alerts (via UI)
# - High error rate (> 5% for 5 minutes)
# - High response time (> 1s for 5 minutes)
# - Deployment failures

PagerDuty Integration

ruby
# Gemfile
gem 'pagerduty'

# config/initializers/pagerduty.rb
PAGERDUTY = Pagerduty.new(ENV['PAGERDUTY_INTEGRATION_KEY'])

# Trigger incident
class CriticalErrorNotifier
  def self.notify(exception, context = {})
    PAGERDUTY.trigger(
      description: exception.message,
      details: {
        exception: exception.class.name,
        backtrace: exception.backtrace.first(5),
        context: context
      },
      severity: 'critical'
    )
  end
end

# Usage
begin
  critical_operation
rescue StandardError => e
  CriticalErrorNotifier.notify(e, user_id: current_user.id)
  raise
end

Slack Notifications

ruby
# Gemfile
gem 'slack-notifier'

# config/initializers/slack.rb
SLACK_NOTIFIER = Slack::Notifier.new(
  ENV['SLACK_WEBHOOK_URL'],
  channel: '#alerts',
  username: 'Production Bot'
)

# app/services/slack_alert_service.rb
class SlackAlertService
  def self.error(message, details = {})
    SLACK_NOTIFIER.ping(
      text: message,
      attachments: [{
        color: 'danger',
        fields: details.map { |k, v| { title: k.to_s, value: v.to_s, short: true } },
        footer: Socket.gethostname,
        ts: Time.current.to_i
      }]
    )
  end

  def self.deployment(version)
    SLACK_NOTIFIER.ping(
      text: "Deployed version #{version} to production",
      attachments: [{
        color: 'good',
        footer: "Deployed at #{Time.current}"
      }]
    )
  end
end

# Usage in error handler
rescue_from StandardError do |exception|
  SlackAlertService.error(
    "Unhandled exception: #{exception.message}",
    controller: params[:controller],
    action: params[:action],
    user_id: current_user&.id
  )
  raise
end

Dashboards

Application Dashboard (New Relic/Datadog)

code
Metrics to track:
- Request throughput (req/min)
- Average response time (ms)
- Error rate (%)
- Apdex score
- Database query time
- Cache hit rate
- Background job queue depth
- Active users

Infrastructure Dashboard

code
Metrics to track:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O (MB/s)
- Network traffic (MB/s)
- Database connections
- Redis memory usage
- Container restart count

Business Metrics Dashboard

code
Custom metrics:
- Orders per hour
- Revenue per hour
- User signups per day
- Conversion rate
- Average order value
- Payment success rate

Tools Required

MCP Servers

[MCP_TOOLS]

<!-- Populated during skill creation based on: 1. User's installed MCP servers 2. User's selection for this skill Recommended for this skill (examples): - APM/Monitoring services (New Relic, Datadog, Sentry) - Log aggregation services (Papertrail, Logtail) - Alerting services (PagerDuty, Slack) - Metrics databases (Prometheus, InfluxDB) Note: Skills work without MCP servers, but functionality may be limited -->

APM & Monitoring

bash
# New Relic
# Sign up at newrelic.com
# Add gem and configure

# Datadog
# Sign up at datadoghq.com
# Install agent

# DigitalOcean Monitoring
doctl monitoring alert list

Log Management

bash
# Papertrail
gem install papertrail-cli
papertrail --help

# Logtail
# Web-based interface

Alerting

bash
# PagerDuty CLI
npm install -g pdjs
pd incident:list

# Slack
# Use webhooks (no CLI needed)

Quality Checklist

Monitoring Coverage

  • Application performance metrics tracked
  • Infrastructure metrics monitored
  • Database performance tracked
  • Background job metrics collected
  • External API latency monitored
  • Business KPIs dashboarded
  • Custom events for critical flows

Logging Best Practices

  • Structured logging (JSON format)
  • Request ID correlation
  • User context included
  • Sensitive data filtered (passwords, tokens)
  • Error stack traces captured
  • Log levels appropriate
  • Centralized log aggregation
  • Retention policy defined

Alerting Rules

  • Critical errors alert immediately
  • High error rate threshold set
  • Performance degradation alerts
  • Infrastructure alerts (CPU, memory, disk)
  • Database connection pool alerts
  • Background job failures
  • Alert routing configured
  • On-call schedule defined
  • Runbooks documented

Health Checks

  • HTTP health endpoint (/health)
  • Database connectivity check
  • Redis connectivity check
  • External service checks
  • Storage accessibility check
  • Health checks logged
  • Checks run on schedule

Observability Patterns

Golden Signals (SRE)

ruby
# 1. Latency - How long requests take
# 2. Traffic - How many requests
# 3. Errors - Rate of failed requests
# 4. Saturation - How full the service is

class MetricsMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    start_time = Time.current
    status, headers, response = @app.call(env)
    duration = Time.current - start_time

    # Latency
    StatsD.timing('http.request.duration', duration * 1000, tags: ["status:#{status}"])

    # Traffic
    StatsD.increment('http.request.count', tags: ["status:#{status}"])

    # Errors
    StatsD.increment('http.request.errors') if status >= 500

    # Saturation tracked separately via infrastructure metrics

    [status, headers, response]
  end
end

RED Metrics (Request-oriented)

ruby
# Rate - Requests per second
# Errors - Number of failed requests
# Duration - Time per request

# Automatically tracked by most APM tools
# Custom implementation:
class REDMetrics
  def self.track(controller, action, duration, status)
    tags = ["controller:#{controller}", "action:#{action}"]

    StatsD.increment('request.rate', tags: tags)
    StatsD.increment('request.errors', tags: tags) if status >= 400
    StatsD.timing('request.duration', duration, tags: tags)
  end
end

Distributed Tracing

ruby
# Using OpenTelemetry
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'myapp'
  c.use_all # Auto-instrument Rails, Redis, PostgreSQL, HTTP
end

# Custom spans
def process_order(order)
  tracer = OpenTelemetry.tracer_provider.tracer('order-processor')

  tracer.in_span('process_order') do |span|
    span.set_attribute('order.id', order.id)
    span.set_attribute('order.total', order.total)

    # Processing logic
    charge_payment(order)
    send_confirmation(order)
  end
end

Performance Targets

Application

  • P95 response time < 500ms
  • P99 response time < 1s
  • Error rate < 0.1%
  • Uptime > 99.9%

Infrastructure

  • CPU utilization < 70% (sustained)
  • Memory usage < 80%
  • Disk usage < 80%
  • Database connections < 80% of pool

Business Metrics

  • Time to detection (TTD) < 5 minutes
  • Time to resolution (TTR) < 30 minutes
  • Alert fatigue: < 5 false positives per week

Troubleshooting Guide

High Error Rate

bash
# 1. Check error logs
tail -f log/production.log | grep ERROR

# 2. Check Sentry/error tracker for patterns
# 3. Verify external service status
# 4. Check recent deployments
gh run list --limit 5

# 5. Review metrics for correlation

Slow Response Times

bash
# 1. Check APM for slow transactions
# 2. Review database query performance
# 3. Check for N+1 queries
# 4. Verify cache hit rates
# 5. Check external API latency

High Memory Usage

bash
# 1. Check for memory leaks
# 2. Review background job memory usage
# 3. Check for large object allocations
# 4. Verify garbage collection metrics
# 5. Consider scaling horizontally

Remember: Observability is not optional - it's the foundation of reliable systems. Invest early in comprehensive monitoring, logging, and alerting to enable fast detection and resolution of issues.