Error Handling Patterns

Pattern Selection Guide

Pattern	When to Use
Exceptions	Unexpected failures, I/O errors, truly exceptional conditions
Result/Either types	Expected failures (validation, parsing), functional codebases
Sentinel errors	Go; comparison with `errors.Is()`
Error codes	Cross-boundary APIs, gRPC status codes
Option/Maybe	Nullable values where absence is normal, not an error
Panic/crash	Unrecoverable errors, programming bugs, violated invariants

Decision Framework

•Can the caller reasonably recover? -> Result type or checked exception
•Is this a programming bug? -> Panic/crash (fail fast)
•Is this crossing a system boundary? -> Error codes with metadata
•Is this just "no value"? -> Option type, not null

Error Categories

Recoverable (handle gracefully):

•Network timeouts, rate limits -> retry with backoff
•Invalid user input -> validation error with details
•Missing resources -> 404, fallback, or cache
•Transient failures -> circuit breaker

Unrecoverable (crash and restart):

•Out of memory, stack overflow
•Corrupted state, violated invariants
•Missing required configuration at startup

Universal Principles

1. Fail Fast and Fail Loud

•Validate inputs at system boundaries immediately
•Don't propagate bad data deep into business logic
•Startup: fail the service if illegal state, missing config, or missing secrets are encountered -- don't limp along in a broken state
•Requests: fail immediately if invalid arguments or unexpected state is encountered -- don't let bad data propagate
•Connect to "make invalid states unrepresentable" (see workflow:code-quality): validate early at boundaries, convert to constrained types, and pass constrained types downstream so invalid states can't reach business logic

2. Handle at the Right Level

•Catch where you can meaningfully handle (retry, fallback, user message)
•Don't catch just to log and re-throw -- that creates duplicate logs
•Low-level code: propagate errors. High-level code: handle them.

3. Preserve Context

•Wrap errors with context: "failed to create user: <original error>"
•Include operation, inputs, and timestamp in error metadata
•Use error chaining (from e in Python, %w in Go, cause in Java)

4. Error Hierarchy Design

code

ApplicationError (base)
  ├── ValidationError (400)
  ├── NotFoundError (404)
  ├── AuthorizationError (403)
  ├── ConflictError (409)
  └── ExternalServiceError (502)
        ├── service name
        └── original error

•Map error types to HTTP status codes at the API boundary
•Include machine-readable code field: "USER_NOT_FOUND", "RATE_LIMITED"
•Keep user-facing messages separate from developer details

5. Don't Swallow Errors

code

# BAD
try:
    do_thing()
except Exception:
    pass  # silent failure

# GOOD
try:
    do_thing()
except SpecificError as e:
    logger.warning(f"Expected failure: {e}")
    return fallback_value

6. Log Appropriately

•Error: Unexpected failures requiring investigation
•Warning: Expected failures handled gracefully
•Don't log: Every caught exception -- only log when you handle or propagate

Resilience Patterns

Retry with Backoff

•Only retry transient errors (network, 503, 429)
•Never retry: 400, 401, 403, 404, 422
•Exponential backoff: delay * 2^attempt with jitter
•Max 3 attempts -- more adds latency without improving success rate
•Use tenacity (Python), p-retry (JS), or language-native constructs

Circuit Breaker

States: CLOSED (normal) -> OPEN (failing, reject fast) -> HALF_OPEN (testing recovery)

Parameter	Starting Value
Failure threshold	5 consecutive failures
Open duration	60 seconds
Half-open success threshold	2 successes to close

•Apply per external dependency, not globally
•Monitor circuit state transitions as metrics
•Use libraries: pybreaker, opossum (JS), gobreaker

Graceful Degradation

•Primary -> fallback -> cached value -> default
•Example: live price API -> cached price -> last known price -> "price unavailable"
•Log each fallback step for observability
•Never let a non-critical dependency take down the whole request

Error Aggregation

•Collect all validation errors before returning (don't fail on first)
•Return all errors at once: { errors: [{field: "email", message: "invalid"}, ...] }
•Use AggregateError (JS) or collect into a list

API Error Response Format

json

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Request validation failed",
    "details": [
      {"field": "email", "message": "must be a valid email"},
      {"field": "age", "message": "must be >= 18"}
    ],
    "request_id": "req_abc123"
  }
}

•Always include request_id for debugging
•code is machine-readable, message is human-readable
•details array for multi-field validation errors
•Never expose stack traces or internal paths in production