AgentSkillsCN

resilience

在工作流级别添加重试、超时和熔断器模式。业务功能保持简洁。

SKILL.md
--- frontmatter
name: resilience
description: "Add retry, timeout, and circuit breaker patterns at the workflow level. Business functions stay clean."
version: 1.0.0
libraries: ["@jagreehal/workflow"]

Resilience Patterns

Core Principle

Resilience is a composition concern, not a business logic concern. Add retry/timeout at the workflow level, not inside functions.

code
Workflows
  -> step.retry() and step.withTimeout()
  -> (resilience here)

Business Functions
  -> fn(args, deps): Result<T, E>
  -> (no retry logic here)

Infrastructure
  -> pg, redis, http
  -> (just transport)

Required Behaviors

1. Retry at Workflow Level Only

NEVER add retry logic inside business functions:

typescript
// WRONG - Retry inside function
async function getUser(args, deps) {
  let attempts = 0;
  while (attempts < 3) {
    try {
      return await deps.db.findUser(args.userId);
    } catch { attempts++; }
  }
}

// CORRECT - Clean function, workflow handles retry
async function getUser(args, deps) {
  const user = await deps.db.findUser(args.userId);
  return user ? ok(user) : err('NOT_FOUND');
}

// Workflow adds resilience
const result = await workflow(async (step) => {
  const user = await step.retry(
    () => getUser({ userId }, deps),
    { attempts: 3, backoff: 'exponential' }
  );
  return user;
});

2. Never Double-Retry

Retry at ONE level only. Multiple layers create retry explosion:

code
3 (API) × 3 (Service) × 3 (DB Client) = 27 attempts!

This DDoS's your own infrastructure.

3. Only Retry Transient Errors

Error TypeRetry?Why
TIMEOUTYesTransient
CONNECTION_ERRORYesNetwork hiccup
RATE_LIMITEDYesWait and retry
NOT_FOUNDNOResource doesn't exist
UNAUTHORIZEDNOCredentials wrong
VALIDATION_FAILEDNOInput invalid
typescript
const data = await step.retry(
  () => fetchFromApi(),
  {
    attempts: 3,
    retryOn: (error) => {
      const retryable = ['TIMEOUT', 'CONNECTION_ERROR', 'RATE_LIMITED'];
      return retryable.includes(error);
    },
  }
);

4. Never Retry Non-Idempotent Writes

typescript
// DANGEROUS - May double-charge
await step.retry(() => chargeCard(amount), { attempts: 3 });

// SAFE - Read is idempotent
await step.retry(() => getUser(userId), { attempts: 3 });

// SAFE - With idempotency key
await step.retry(
  () => chargeCard(amount, { idempotencyKey }),
  { attempts: 3 }
);

5. Always Set Timeouts

Never let operations hang indefinitely:

typescript
const data = await step.withTimeout(
  () => slowOperation(),
  { ms: 2000 }
);

6. Always Use Jitter

Prevents thundering herd when multiple instances retry:

typescript
// Without jitter - all instances retry at same time
// With jitter - spread out, infrastructure can recover

step.retry(() => fetchData(), {
  attempts: 3,
  backoff: 'exponential',
  jitter: true,  // ALWAYS enable in production
});

7. Combine Retry and Timeout

Each attempt gets its own timeout:

typescript
const data = await step.retry(
  () => fetchData(),
  {
    attempts: 3,
    timeout: { ms: 2000 },  // 2s per attempt
  }
);
// Total max time: 3 × 2s = 6s

Recommended Defaults

OperationAttemptsBackoffInitial DelayTimeout
DB read3exponential50ms5s
DB write1--10s
HTTP API3exponential100ms30s
Cache2fixed10ms500ms

Full Example

typescript
import { createWorkflow } from '@jagreehal/workflow';

// Clean business function
async function getUser(args, deps): AsyncResult<User, 'NOT_FOUND' | 'DB_ERROR'> {
  try {
    const user = await deps.db.findUser(args.userId);
    return user ? ok(user) : err('NOT_FOUND');
  } catch {
    return err('DB_ERROR');
  }
}

// Workflow adds resilience
const loadUser = createWorkflow({ getUser });

const result = await loadUser(async (step) => {
  const user = await step.retry(
    () => getUser({ userId }, deps),
    {
      attempts: 3,
      backoff: 'exponential',
      initialDelay: 100,
      maxDelay: 2000,
      jitter: true,
      timeout: { ms: 5000 },
    }
  );
  return user;
});

8. Retrying Multi-Step Operations

Sometimes you need to retry a multi-step operation. Use step.retry() to wrap the entire sequence:

typescript
const syncUserToProvider = createWorkflow({ findUser, syncUser, markSynced });

const result = await syncUserToProvider(async (step) => {
  // Retry the whole operation
  const user = await step.retry(
    async () => {
      const user = await step(() => findUser({ userId }, deps));
      await step(() => syncUser({ user }, deps));  // Must be idempotent!
      await step(() => markSynced({ userId }, deps));
      return user;
    },
    {
      attempts: 2,
      backoff: 'exponential',
    }
  );
  return user;
});

Important: The entire sequence must be idempotent. If syncUser is called twice, it should have the same effect as calling it once.

9. Circuit Breakers

When a service is down, stop hammering it. Circuit breakers prevent cascade failures:

typescript
// Circuit breaker states
// CLOSED: Normal operation, requests go through
// OPEN: Service down, fail fast without trying
// HALF_OPEN: Testing if service recovered

Circuit breakers are outside the scope of step.retry(), but consider libraries like opossum or cockatiel for production systems where dependencies fail frequently.

When to use circuit breakers:

  • External APIs that may be down for extended periods
  • Services with rate limits that trigger failures
  • Downstream dependencies in microservices

Don't use for:

  • Database calls (usually want retry instead)
  • Internal function calls

10. Handling Timeout Errors

Use helpers to detect and handle timeouts:

typescript
import { isStepTimeoutError, getStepTimeoutMeta } from '@jagreehal/workflow';

const result = await workflow(async (step) => {
  const data = await step.withTimeout(
    () => slowOperation(),
    { ms: 5000 }
  );
  return data;
});

if (!result.ok && isStepTimeoutError(result.error)) {
  const meta = getStepTimeoutMeta(result.error);
  deps.logger.warn('Operation timed out', {
    timeoutMs: meta?.timeoutMs,
    attempt: meta?.attempt,
  });
}

The Rules

Failure TypeWhere to Retry
Transport/networkWorkflow level
Idempotent readsWorkflow level
Non-idempotent writesNEVER (or with idempotency key)
Multi-step operationWorkflow level (if idempotent)
  1. Retry at workflow level only
  2. Never double-retry across layers
  3. Only retry transient errors
  4. Never retry non-idempotent writes without idempotency key
  5. Always set timeouts
  6. Always use jitter in production
  7. Use circuit breakers for external services