Observability Skill
Defines observability standards for .NET services including SLIs, dashboards, alerts, and instrumentation patterns.
Roles
- •Architect: Defines observability requirements in technical design
- •Developer: Implements per Architect's specifications
Standard SLIs
| SLI | Description | Target | Measurement |
|---|---|---|---|
| Latency p50 | Median response time | < 100ms | Histogram quantile |
| Latency p95 | 95th percentile response time | < 500ms | Histogram quantile |
| Latency p99 | 99th percentile response time | < 1000ms | Histogram quantile |
| Error Rate | 5xx responses / total requests | < 0.1% | Counter ratio |
| Availability | Successful health checks / total | > 99.9% | Uptime probe |
| Saturation | CPU/Memory utilization | < 80% | Resource metrics |
| Throughput | Requests per second | Service-specific | Counter rate |
Alert Thresholds
Severity Levels
| Severity | Response | Examples |
|---|---|---|
| Critical | Immediate page | Service down, data loss risk, SLA breach |
| Warning | Review within hours | Degraded performance, approaching limits |
| Info | Review next business day | Anomalies, capacity planning signals |
Standard Thresholds
| Metric | Warning | Critical | For Duration |
|---|---|---|---|
| Error rate | > 1% | > 5% | 5 min / 2 min |
| Latency p95 | > 1s | > 3s | 5 min / 2 min |
| Latency p99 | > 2s | > 5s | 5 min / 2 min |
| CPU usage | > 70% | > 90% | 10 min / 5 min |
| Memory usage | > 75% | > 90% | 10 min / 5 min |
| Queue depth | > 1000 | > 5000 | 5 min / 2 min |
| Queue age (oldest msg) | > 5 min | > 15 min | 5 min / 2 min |
| Health check failures | 1 failure | 3 consecutive | immediate / 1 min |
| Connection pool exhaustion | > 80% | > 95% | 5 min / 2 min |
Dashboard Templates
Service Health Dashboard
Required panels:
- •Request Rate - req/s over time
- •Error Rate - % errors with breakdown by status code
- •Latency Histogram - p50, p95, p99 percentiles
- •Active Connections - Current connection count
- •Health Check Status - Liveness and readiness state
- •Instance Count - Number of running replicas
API Performance Dashboard
Required panels:
- •Endpoint Latency Breakdown - Latency by endpoint
- •Top 10 Slowest Endpoints - Sorted by p95 latency
- •Error Breakdown by Status Code - 4xx vs 5xx distribution
- •Request Volume by Endpoint - Traffic distribution
- •Request Duration Heatmap - Time vs latency visualization
Resource Usage Dashboard
Required panels:
- •CPU Utilization - Per instance over time
- •Memory Usage - Heap, working set, GC metrics
- •GC Metrics - Gen0/Gen1/Gen2 collections, pause times
- •Thread Pool - Worker threads, completion port threads
- •Connection Pools - Database, HTTP client pool saturation
- •Disk I/O - If applicable
OpenTelemetry Patterns
Use MyOrganization.OpenTelemetry library. See README.
Observability Triad
All services should inject these three interfaces for complete observability:
| Interface | Purpose | Usage |
|---|---|---|
ILogger<T> | Structured logging | Log events with contextual data |
IDistributedTracing | Distributed tracing | Create spans/activities for operations |
IMeterFactory | Metrics | Create counters, histograms, gauges |
Constructor pattern:
csharp
public MyService(
ILogger<MyService> logger,
IDistributedTracing distributedTracing,
IMeterFactory meterFactory)
Registration
csharp
// Program.cs builder.ConfigureOpenTelemetry();
Configuration
json
{
"OpenTelemetry": {
"Service": {
"Name": "{ServiceName}",
"Version": "1.0.0",
"Namespace": "{Namespace}"
},
"Http": {
"RecordException": true,
"CaptureBody": true
},
"Sql": {
"CaptureParameters": true
}
},
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://otel-collector:4317"
}
Middleware (Optional)
csharp
app.UseRouting(); app.UseHttpBodyCapture(); // Must be after UseRouting app.MapControllers();
Custom Metrics
Use IMeterFactory for metrics. Combined with ILogger<T> and IDistributedTracing, these form the observability triad.
csharp
public class MyService : IDisposable
{
private readonly ILogger<MyService> _logger;
private readonly IDistributedTracing _tracer;
private readonly Meter _meter;
private readonly Counter<long> _itemsProcessed;
private readonly Histogram<double> _processingDuration;
public MyService(
ILogger<MyService> logger,
IDistributedTracing distributedTracing,
IMeterFactory meterFactory)
{
_logger = logger;
_tracer = distributedTracing;
_meter = meterFactory.Create(new MeterOptions(Startup.AssemblyName)
{
Version = Startup.AssemblyVersion,
Tags = new TagList
{
{ "code.namespace", GetType().Namespace },
{ "code.class", GetType().Name }
}
});
_itemsProcessed = _meter.CreateCounter<long>(
"items_processed",
unit: "{item}",
description: "Number of items processed");
_processingDuration = _meter.CreateHistogram<double>(
"processing_duration",
unit: "ms",
description: "Time to process an item");
}
public async Task ProcessAsync(Item item)
{
using var activity = _tracer.StartActivity("ProcessItem");
var sw = Stopwatch.StartNew();
try
{
_logger.LogDebug("Processing item {ItemId}", item.Id);
// Process item
_itemsProcessed.Add(1, new TagList { { "status", "success" } });
activity.SetStatus(ActivityStatusCode.Ok);
}
catch (Exception ex)
{
_itemsProcessed.Add(1, new TagList { { "status", "failure" } });
activity.SetStatus(ActivityStatusCode.Error, ex.Message);
_logger.LogError(ex, "Failed to process item {ItemId}", item.Id);
throw;
}
finally
{
_processingDuration.Record(sw.ElapsedMilliseconds);
}
}
public void Dispose()
{
_meter.Dispose();
GC.SuppressFinalize(this);
}
}
Custom Traces
Use IDistributedTracing interface for distributed tracing. This is injected via constructor.
csharp
public class MyService
{
private readonly ILogger<MyService> _logger;
private readonly IDistributedTracing _tracer;
public MyService(
ILogger<MyService> logger,
IDistributedTracing distributedTracing)
{
_logger = logger;
_tracer = distributedTracing;
}
public async Task ProcessAsync(Item item)
{
using var activity = _tracer.StartActivity("ProcessItem");
activity.SetTag("item.id", item.Id);
activity.SetTag("item.type", item.Type);
try
{
_logger.LogInformation("Processing item {ItemId}", item.Id);
// Process item
activity.SetStatus(ActivityStatusCode.Ok);
}
catch (Exception ex)
{
activity.SetStatus(ActivityStatusCode.Error, ex.Message);
activity.SetTag("exception.type", ex.GetType().FullName);
activity.SetTag("exception.message", ex.Message);
_logger.LogError(ex, "Failed to process item {ItemId}", item.Id);
throw;
}
}
}
Semantic Conventions
Use OpenTelemetry semantic conventions for tag names:
| Category | Convention | Example |
|---|---|---|
| HTTP | http.method, http.status_code, http.url | http.method=POST |
| Database | db.system, db.name, db.statement | db.system=mssql |
| Messaging | messaging.system, messaging.destination | messaging.system=rabbitmq |
| Exception | exception.type, exception.message | exception.type=InvalidOperationException |
Architect Checklist
When defining observability requirements in technical design:
- • Which SLIs matter for this service?
- • What are the target values for each SLI?
- • Which dashboards are required?
- • What alert conditions and thresholds apply?
- • What custom metrics are needed?
- • What traces should be captured?
- • What log levels and structured fields are required?
Developer Checklist
When implementing observability:
- • OpenTelemetry configured via
ConfigureOpenTelemetry() - • Service name and version set in configuration
- • Custom metrics created per Architect's requirements
- • Critical operations have traces with appropriate tags
- • Structured logging with event IDs for significant operations
- • Grafana dashboard JSON created
- • Alert rules configured
- • Runbook draft includes observability section