AgentSkillsCN

observability

可观测性

SKILL.md
--- frontmatter
name: observability
type: guidance
applies_to:
  - Architect
  - Developer
mandatory: conditional
triggers:
  - dashboard
  - metrics
  - tracing
  - alerting
  - SLI
  - observability
references: []
summary: Standard SLIs, dashboard templates, alert conventions, and OpenTelemetry patterns for .NET services.

Observability Skill

Defines observability standards for .NET services including SLIs, dashboards, alerts, and instrumentation patterns.

Roles

  • Architect: Defines observability requirements in technical design
  • Developer: Implements per Architect's specifications

Standard SLIs

SLIDescriptionTargetMeasurement
Latency p50Median response time< 100msHistogram quantile
Latency p9595th percentile response time< 500msHistogram quantile
Latency p9999th percentile response time< 1000msHistogram quantile
Error Rate5xx responses / total requests< 0.1%Counter ratio
AvailabilitySuccessful health checks / total> 99.9%Uptime probe
SaturationCPU/Memory utilization< 80%Resource metrics
ThroughputRequests per secondService-specificCounter rate

Alert Thresholds

Severity Levels

SeverityResponseExamples
CriticalImmediate pageService down, data loss risk, SLA breach
WarningReview within hoursDegraded performance, approaching limits
InfoReview next business dayAnomalies, capacity planning signals

Standard Thresholds

MetricWarningCriticalFor Duration
Error rate> 1%> 5%5 min / 2 min
Latency p95> 1s> 3s5 min / 2 min
Latency p99> 2s> 5s5 min / 2 min
CPU usage> 70%> 90%10 min / 5 min
Memory usage> 75%> 90%10 min / 5 min
Queue depth> 1000> 50005 min / 2 min
Queue age (oldest msg)> 5 min> 15 min5 min / 2 min
Health check failures1 failure3 consecutiveimmediate / 1 min
Connection pool exhaustion> 80%> 95%5 min / 2 min

Dashboard Templates

Service Health Dashboard

Required panels:

  1. Request Rate - req/s over time
  2. Error Rate - % errors with breakdown by status code
  3. Latency Histogram - p50, p95, p99 percentiles
  4. Active Connections - Current connection count
  5. Health Check Status - Liveness and readiness state
  6. Instance Count - Number of running replicas

API Performance Dashboard

Required panels:

  1. Endpoint Latency Breakdown - Latency by endpoint
  2. Top 10 Slowest Endpoints - Sorted by p95 latency
  3. Error Breakdown by Status Code - 4xx vs 5xx distribution
  4. Request Volume by Endpoint - Traffic distribution
  5. Request Duration Heatmap - Time vs latency visualization

Resource Usage Dashboard

Required panels:

  1. CPU Utilization - Per instance over time
  2. Memory Usage - Heap, working set, GC metrics
  3. GC Metrics - Gen0/Gen1/Gen2 collections, pause times
  4. Thread Pool - Worker threads, completion port threads
  5. Connection Pools - Database, HTTP client pool saturation
  6. Disk I/O - If applicable

OpenTelemetry Patterns

Use MyOrganization.OpenTelemetry library. See README.

Observability Triad

All services should inject these three interfaces for complete observability:

InterfacePurposeUsage
ILogger<T>Structured loggingLog events with contextual data
IDistributedTracingDistributed tracingCreate spans/activities for operations
IMeterFactoryMetricsCreate counters, histograms, gauges

Constructor pattern:

csharp
public MyService(
    ILogger<MyService> logger,
    IDistributedTracing distributedTracing,
    IMeterFactory meterFactory)

Registration

csharp
// Program.cs
builder.ConfigureOpenTelemetry();

Configuration

json
{
  "OpenTelemetry": {
    "Service": {
      "Name": "{ServiceName}",
      "Version": "1.0.0",
      "Namespace": "{Namespace}"
    },
    "Http": {
      "RecordException": true,
      "CaptureBody": true
    },
    "Sql": {
      "CaptureParameters": true
    }
  },
  "OTEL_EXPORTER_OTLP_ENDPOINT": "http://otel-collector:4317"
}

Middleware (Optional)

csharp
app.UseRouting();
app.UseHttpBodyCapture(); // Must be after UseRouting
app.MapControllers();

Custom Metrics

Use IMeterFactory for metrics. Combined with ILogger<T> and IDistributedTracing, these form the observability triad.

csharp
public class MyService : IDisposable
{
    private readonly ILogger<MyService> _logger;
    private readonly IDistributedTracing _tracer;
    private readonly Meter _meter;
    private readonly Counter<long> _itemsProcessed;
    private readonly Histogram<double> _processingDuration;

    public MyService(
        ILogger<MyService> logger,
        IDistributedTracing distributedTracing,
        IMeterFactory meterFactory)
    {
        _logger = logger;
        _tracer = distributedTracing;
        _meter = meterFactory.Create(new MeterOptions(Startup.AssemblyName)
        {
            Version = Startup.AssemblyVersion,
            Tags = new TagList
            {
                { "code.namespace", GetType().Namespace },
                { "code.class", GetType().Name }
            }
        });

        _itemsProcessed = _meter.CreateCounter<long>(
            "items_processed",
            unit: "{item}",
            description: "Number of items processed");

        _processingDuration = _meter.CreateHistogram<double>(
            "processing_duration",
            unit: "ms",
            description: "Time to process an item");
    }

    public async Task ProcessAsync(Item item)
    {
        using var activity = _tracer.StartActivity("ProcessItem");
        var sw = Stopwatch.StartNew();

        try
        {
            _logger.LogDebug("Processing item {ItemId}", item.Id);
            // Process item
            _itemsProcessed.Add(1, new TagList { { "status", "success" } });
            activity.SetStatus(ActivityStatusCode.Ok);
        }
        catch (Exception ex)
        {
            _itemsProcessed.Add(1, new TagList { { "status", "failure" } });
            activity.SetStatus(ActivityStatusCode.Error, ex.Message);
            _logger.LogError(ex, "Failed to process item {ItemId}", item.Id);
            throw;
        }
        finally
        {
            _processingDuration.Record(sw.ElapsedMilliseconds);
        }
    }

    public void Dispose()
    {
        _meter.Dispose();
        GC.SuppressFinalize(this);
    }
}

Custom Traces

Use IDistributedTracing interface for distributed tracing. This is injected via constructor.

csharp
public class MyService
{
    private readonly ILogger<MyService> _logger;
    private readonly IDistributedTracing _tracer;

    public MyService(
        ILogger<MyService> logger,
        IDistributedTracing distributedTracing)
    {
        _logger = logger;
        _tracer = distributedTracing;
    }

    public async Task ProcessAsync(Item item)
    {
        using var activity = _tracer.StartActivity("ProcessItem");
        activity.SetTag("item.id", item.Id);
        activity.SetTag("item.type", item.Type);

        try
        {
            _logger.LogInformation("Processing item {ItemId}", item.Id);
            // Process item
            activity.SetStatus(ActivityStatusCode.Ok);
        }
        catch (Exception ex)
        {
            activity.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity.SetTag("exception.type", ex.GetType().FullName);
            activity.SetTag("exception.message", ex.Message);
            _logger.LogError(ex, "Failed to process item {ItemId}", item.Id);
            throw;
        }
    }
}

Semantic Conventions

Use OpenTelemetry semantic conventions for tag names:

CategoryConventionExample
HTTPhttp.method, http.status_code, http.urlhttp.method=POST
Databasedb.system, db.name, db.statementdb.system=mssql
Messagingmessaging.system, messaging.destinationmessaging.system=rabbitmq
Exceptionexception.type, exception.messageexception.type=InvalidOperationException

Architect Checklist

When defining observability requirements in technical design:

  1. Which SLIs matter for this service?
  2. What are the target values for each SLI?
  3. Which dashboards are required?
  4. What alert conditions and thresholds apply?
  5. What custom metrics are needed?
  6. What traces should be captured?
  7. What log levels and structured fields are required?

Developer Checklist

When implementing observability:

  1. OpenTelemetry configured via ConfigureOpenTelemetry()
  2. Service name and version set in configuration
  3. Custom metrics created per Architect's requirements
  4. Critical operations have traces with appropriate tags
  5. Structured logging with event IDs for significant operations
  6. Grafana dashboard JSON created
  7. Alert rules configured
  8. Runbook draft includes observability section