Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

•Building fault-tolerant multi-agent systems
•Implementing LLM API integrations with proper error handling
•Designing distributed workflows that need graceful degradation
•Adding observability to failure scenarios
•Protecting systems from cascade failures

Core Patterns

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

code

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

Key Configuration:

•failure_threshold: Failures before opening (default: 5)
•recovery_timeout: Seconds before attempting recovery (default: 30)
•half_open_requests: Probes to allow in half-open (default: 1)

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

code

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier	Workers	Queue	Timeout	Use Case
1 (Critical)	5	10	300s	Synthesis, quality gate
2 (Standard)	3	5	120s	Content analysis agents
3 (Optional)	2	3	60s	Enrichment, caching

3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

code

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

Error Classification for Retries:

python

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

code

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token Budget Management:

code

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

Pattern	When to Use	Key Benefit
Circuit Breaker	External service calls	Prevent cascade failures
Bulkhead	Multi-tenant/multi-agent	Isolate failures
Retry + Backoff	Transient failures	Automatic recovery
Fallback Chain	Critical operations	Graceful degradation
Token Budget	LLM calls	Cost control, prevent failures

OrchestKit Integration Points

•Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
•LLM Calls: All model invocations use fallback chain + retry logic
•External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
•Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

•references/circuit-breaker.md - Deep dive on circuit breaker pattern
•references/bulkhead-pattern.md - Bulkhead isolation strategies
•references/retry-strategies.md - Retry algorithms and error classification
•references/llm-resilience.md - LLM-specific patterns
•references/error-classification.md - How to categorize errors

Templates (Code Patterns)

•scripts/circuit-breaker.py - Ready-to-use circuit breaker class
•scripts/bulkhead.py - Semaphore-based bulkhead implementation
•scripts/retry-handler.py - Configurable retry decorator
•scripts/llm-fallback-chain.py - Multi-model fallback pattern
•scripts/token-budget.py - Token budget guard implementation

Examples

•examples/orchestkit-workflow-resilience.md - Full OrchestKit integration example

Checklists

•checklists/pre-deployment-resilience.md - Production readiness checklist
•checklists/circuit-breaker-setup.md - Circuit breaker configuration guide

2026 Best Practices

•Adaptive Thresholds: Use sliding windows, not fixed counters
•Observability First: Every circuit trip = alert + metric + trace
•Graceful Degradation: Always have a fallback, even if partial
•Health Endpoints: Separate health check from circuit state
•Chaos Testing: Regularly test failure scenarios in staging

Related Skills

•observability-monitoring - Metrics and alerting for circuit breaker state changes
•caching-strategies - Cache as fallback layer in degradation scenarios
•error-handling-rfc9457 - Structured error responses for resilience failures
•background-jobs - Async processing with retry and failure handling

Key Decisions

Decision	Choice	Rationale
Circuit breaker recovery	Half-open probe	Gradual recovery, prevents immediate re-failure
Retry algorithm	Exponential backoff + jitter	Prevents thundering herd, respects rate limits
Bulkhead isolation	Semaphore-based tiers	Simple, efficient, prioritizes critical operations
LLM fallback	Model chain with cache	Graceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

•Prevent cascade failures when external services fail
•Automatically recover when services come back online
•Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

•Isolate failures to prevent entire system crashes
•Prioritize critical operations over optional ones
•Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

•Handle transient failures automatically
•Avoid overwhelming recovering services
•Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

•Handle LLM API rate limits gracefully
•Fall back to alternative models when primary fails
•Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

•Determine which errors should be retried
•Categorize errors by severity and recoverability
•Map HTTP status codes to resilience actions