Observability & Production Readiness Review
Scope Definition (MANDATORY)
Identify the operational boundary under review.
- •Logging Infra: Structured logging implementation, log rotation, ingestion.
- •Error Handling Centralization: Sentry/Datadog integration, Global exception handlers.
- •Stability Points: Health checks, retries, circuit breakers, timeout policies.
Audit Dimensions
1. Logging Strategy & Data Privacy
Evaluate the quality and safety of system logs.
- •Structured Logging: Are logs in JSON or raw text? (Searchable logs require structure).
- •PII Leakage: Are we logging emails, passwords, QR codes, or internal user IDs in plain text?
- •Log Levels: Correct use of
DEBUG,INFO,WARN,ERROR. Is production too noisy or silent?
2. Global Error Centralization
- •Exception Mapping: Does the system use a global handler to catch unhandled rejections?
- •Context Availability: Do error reports include request IDs, user IDs, and stack traces?
- •Silent Failures: Detecting
try/catchblocks that swallow errors without logging or reporting.
3. Resilience & Failure Containment
- •Retry Policies: Are external calls (e.g., mailer, payment API) using exponential backoff?
- •Timeouts: Are there blocking calls without timeouts that could hang the entire process?
- •Health Checks: Existence of endpoints for orchestration (Kubernetes/Docker) to detect dead processes.
4. Security Audit Trail
- •Action Logging: Are security-sensitive actions (Role changes, login attempts, QR manually validated) logged in an immutable audit trail?
- •Traceability: Can a single request be traced from Frontend through all Backend services using a Correlation ID?
Severity Classification
- •CRITICAL: Sensitive data leak (PII) in logs or total lack of unhandled exception tracking.
- •HIGH: Silent failures on critical business logic or lack of audit trail for sensitive actions.
- •MEDIUM: Unstructured logs or missing timeouts in external integrations.
- •LOW: Inconsistent log levels or minor missing context in error reports.
What this skill does NOT review (Avoid overlap)
- •Business Correctness: Does the logic work? (Use
backend-code-review). - •UI Structure: React/Frontend architecture (Use
frontend-structure-review). - •Token Security: JWT crypto (Use
token-expiration-check). - •Data Layer: SQL/Index performance (Use
data-layer-review).
When to use/not use
- •USE when: Preparing for production deployment or auditing system stability.
- •DO NOT USE when: Reviewing frontend CSS or local business logic changes.
Mandatory Output Format
1. Observability Maturity Gaps
Ranked list of missing operational links.
2. Risk Assessment Table
| Operational Risk | Impact | Recommended Action |
|---|---|---|
| [PII Leak] | HIGH | ... |
| [No Retries] | MEDIUM | ... |
3. Stability Baseline
Assessment of timeouts, retries, and health checks coverage.
4. Hardening Roadmap
Immediate fixes for production readiness and long-term instrumentation plan.