AgentSkillsCN

Fault Tolerance

容错性

SKILL.md
--- frontmatter
skill_id: ARCH-FAULT-TOL
version: 1.0.0
last_updated: 2026-01-04
applies_to: [Class A, Class B, Class C]
jurisdiction: [Global]
prerequisites: [ARCH-SAFETY-CLASS]

Fault Tolerance Design

Purpose

Provide patterns for detecting, containing, and recovering from faults in medical device software, scaled to safety class.

When to Apply

  • Safety-critical control loops, sensing/actuation, communication paths.
  • Watchdogs, redundancy, health monitoring, self-test.
  • Power, memory, and comms error handling.

Requirements (testable)

  1. Fault Detection: Implement monitoring for critical resources (tasks, sensors, comms) with thresholds and alarms. Rationale: early detection.
  2. Graceful Degradation: Define degraded modes or safe state when partial functionality fails. Rationale: bounded failure.
  3. Redundancy Strategy: For Class C functions, consider redundancy (sensing, computation, or communication) with voter/consistency checks. Rationale: resilience.
  4. Watchdog Use: Configure hardware/software watchdogs with bounded servicing windows; service only after critical checks pass. Rationale: recover from hangs.
  5. Self-Test/BIST: Run self-tests at startup and periodically for critical components; handle failures deterministically. Rationale: latent fault detection.
  6. Error Propagation Control: Sanitize/contain errors at boundaries; avoid cascading faults. Rationale: containment.
  7. Logging & Alarms: Log and, where required, annunciate safety-relevant faults; ensure tamper-evident logs for post-incident analysis. Rationale: traceability.

Recommended Practices

  • Use majority voting or reasonableness checks instead of blind trust in single sensors.
  • Employ brownout/power-fail detection to enter safe state gracefully.
  • For RTOS, assign dedicated safety monitor task with higher priority than non-critical tasks.
  • Debounce fault signals to reduce false positives but cap with timeouts.

Patterns

Watchdog servicing with checks:

c
// REQ-FT-WD-01; TEST-FT-03
void service_watchdog(void) {
    if (critical_tasks_healthy() && comms_alive()) {
        wdt_kick();
    } else {
        // Do not kick; let watchdog reset into safe boot
    }
}

Sensor plausibility check:

c
// REQ-FT-SNS-02; TEST-FT-07
bool validate_pressure(float p_kpa) {
    return (p_kpa >= 0.0f && p_kpa <= 300.0f);
}

Redundant reading vote:

c
// REQ-FT-RED-01; TEST-FT-10
float fused_temp(float a, float b) {
    if (fabsf(a - b) > 2.0f) {
        alarm_sensor_disagree();
        enter_safe_state();
    }
    return (a + b) * 0.5f;
}

Anti-Patterns (risks)

  • Servicing watchdog unconditionally in main loop -> risk: hides deadlocks.
  • Single-point sensors without plausibility checks -> risk: unsafe outputs.
  • Logging faults without annunciation where required -> risk: latent hazards.
  • No degraded mode or safe fallback -> risk: uncontrolled failure behavior.

Verification Checklist

  • Fault monitors implemented for critical resources with thresholds/timeouts.
  • Watchdog configuration reviewed; serviced only after health checks.
  • Degraded modes or safe state defined and reachable on fault.
  • Redundancy/plausibility checks implemented for critical sensors/paths.
  • Self-tests executed at startup/periodically; failures handled deterministically.
  • Errors contained at boundaries; no unchecked propagation.
  • Faults logged and annunciated as applicable; integrity of logs maintained.

Traceability

  • Link REQ-FT-### to hazards and controls; map to tests (TEST-FT-###).
  • Store watchdog and fault monitor configuration with release artifacts.

References

  • IEC 62304 design/implementation expectations (fault control).
  • ISO 14971 for risk-driven fault handling.
  • IEC 60601-1 (power/brownout considerations; informative).

Changelog

  • 1.0.0 (2026-01-04): Initial fault tolerance patterns with watchdog, redundancy, and safe fallback guidance.

Audit History

  • 2026-01-04: Audit performed. Verified:
    • Fault tolerance patterns technically accurate
    • IEC 60601-1 reference appropriate as informative for power/brownout considerations
    • Watchdog and redundancy patterns follow industry best practices