java-rest-clients-resilience

Intent

Make outbound HTTP calls predictable under partial failure by standardizing:

•timeout budgeting,
•retry policy (only when safe),
•circuit breaker + bulkhead isolation,
•idempotency and deduplication,
•observability and safe fallbacks.

When to use

•Upstream is flaky or slow; cascading failures appear
•Microservices latency debugging: “which dependency is causing p99?”
•You need safe retries and consistent client behavior across services
•Incident response: timeouts and retries are inconsistent across codebase

Core principles

•Timeouts are not optional.
•Retries are a controlled tool, not a default.
•Prefer failing fast + graceful degradation over queueing forever.
•Idempotency must be explicit for any operation that might be retried.
•Isolation prevents blast radius (bulkheads + circuit breakers).

Step 0 — Dependency classification (required)

For each upstream endpoint, classify:

•Criticality: critical / important / best-effort
•Expected latency: p50/p95/p99
•Error modes: timeouts, 5xx, 429, connection resets
•Idempotency: safe to retry? (yes/no/only with idempotency key)
•Capacity constraints: rate limits, QPS caps
•Fallback option: cached data, default response, partial features

Step 1 — Timeout budgeting (the backbone)

Set explicit budgets:

•connect timeout (TCP/TLS establishment)
•request timeout / overall deadline
•read timeout (if client supports separate)
•per-attempt timeout vs total timeout

Rule:

•Total timeout must be < caller’s own deadline / budget.
•In multi-hop calls, budgets must shrink each hop.

Recommended structure:

•totalDeadline = 800ms
•maxAttempts = 2
•perAttemptTimeout = 300ms
•remainingBudget reserved for local processing

Step 2 — Retry policy (safe-by-default)

Retry only when:

•the operation is idempotent, OR
•you have an idempotency key + server dedup support.

Never retry blindly on:

•validation errors (4xx except 429/408 depending on semantics)
•non-idempotent operations without idempotency keys

Backoff:

•exponential backoff with jitter
•respect Retry-After when provided (429/503 patterns)

Bound retries:

•max attempts (2–3 typical)
•max elapsed time (stop when budget is exhausted)

Step 3 — Circuit breaker (stop the bleeding)

Use when:

•repeated failures indicate upstream is unhealthy
•retries would amplify load and worsen outage

Circuit breaker parameters (conceptual):

•sliding window size
•failure rate threshold
•slow call threshold
•open state duration
•half-open permitted calls

Behavior:

•Open: fail fast (or fallback)
•Half-open: probe carefully

Step 4 — Bulkhead isolation (contain blast radius)

Goal: one slow dependency must not exhaust all threads/CPU.

Choose bulkhead type:

•Semaphore bulkhead: limit concurrent calls (lightweight)
•Thread-pool bulkhead: isolate blocking calls in dedicated pool (heavier, strong isolation)

Guidelines:

•limit concurrency per upstream
•separate pools for “critical” vs “best-effort” dependencies
•set queue sizes carefully; prefer rejection over unbounded queues

Step 5 — Idempotency keys (for safe retries on non-idempotent ops)

For POST/PUT operations that may create/charge/side-effect:

•generate Idempotency-Key (UUID) per logical request
•store/propagate it across retries
•server must deduplicate by (clientId, idempotencyKey) for a time window
•return the same result on duplicate requests

If you cannot implement server-side dedup:

•do NOT retry non-idempotent requests
•consider outbox pattern or explicit request tokens

Step 6 — Fallbacks and degradation

Fallback types:

•cached response (stale-while-revalidate)
•partial response (omit optional sections)
•default value with warning flag
•queue for asynchronous processing (if business allows)

Guardrails:

•fallback must be observable (metrics/logs)
•do not hide persistent outages silently

Step 7 — Observability (must-have)

Metrics:

•client_requests_total{upstream,method,status_class}
•client_latency_seconds (histogram/timer) per upstream
•retries_total and retry_exhausted_total
•circuit_state (open/half-open/closed)
•bulkhead_rejected_total Tracing:
•
add span per outbound call with:
- •upstream name
- •attempt count
- •timeout values
- •result (success/failure category) Logs:
•structured, rate-limited logs on retry exhaustion and circuit opens

Output artifacts

A) Retry matrix (example template)

Fill this table per upstream endpoint:

•
Endpoint: /v1/payments
- •Method: POST
- •Idempotent by HTTP semantics: NO
- •Idempotency-Key available: YES
- •Retry on: connect timeout, 502/503/504, 429 (respect Retry-After)
- •Max attempts: 2
- •Backoff: exponential + jitter
- •Total deadline: 800ms
- •Circuit breaker: enabled
- •Bulkhead: enabled (critical pool)
•
Endpoint: /v1/profile/{id}
- •Method: GET
- •Idempotent by HTTP semantics: YES
- •Retry on: connect timeout, 502/503/504, 429
- •Max attempts: 2
- •Backoff: exponential + jitter
- •Total deadline: 300ms
- •Circuit breaker: enabled
- •Bulkhead: semaphore bulkhead

B) Client configuration plan

•timeouts (connect + request deadline)
•retry rules and stop conditions
•circuit breaker thresholds
•bulkhead limits
•idempotency headers propagation

C) Verification tests

•unit tests for retry classification (which errors are retried)
•
integration tests with a stub server (WireMock) simulating:
- •timeouts
- •5xx bursts
- •429 + Retry-After
- •slow calls
•load test to ensure retries don’t multiply traffic dangerously

Definition of Done (DoD)

• All outbound calls have explicit timeouts
• Retry policy is bounded, budget-aware, and safe-by-default
• Circuit breaker + bulkhead applied to critical upstreams
• Idempotency keys implemented for retried non-idempotent operations
• Metrics/traces/logs added for client behavior
• Tests cover failure modes and prevent regressions

Guardrails (What NOT to do)

•Do not retry non-idempotent calls without idempotency keys
•Do not set large timeouts “to make errors go away”
•Do not use unbounded queues for bulkheads
•Do not enable retries + no circuit breaker (risk: cascading failure)
•Do not ignore rate limits; respect Retry-After or implement client-side rate limiting

Cursor usage (recommended)

Attach:

•HTTP client wrapper (where requests are made)
•Resilience config (if exists)
•Production incident notes (timeouts, 5xx bursts, upstream SLO) Prompt snippet: “Use java-rest-clients-resilience. Produce a retry matrix and propose timeouts, retries, circuit breaker, and bulkhead settings. Ensure idempotency-key handling and add WireMock tests.”