AgentSkillsCN

fault-tree-analysis

运用布尔逻辑门,系统性地识别并分析系统故障的成因,开展故障树分析(FTA)。这是一种自上而下的演绎方法,广泛应用于安全与可靠性工程领域。当您需要分析系统故障、评估安全关键型设计、计算故障概率、识别最小割集、评估冗余有效性,或当用户提及“故障树”、“FTA”、“系统故障分析”、“最小割集”、“安全分析”、“故障概率”、“AND/OR门”,或需要追溯从顶层事件到基本事件的故障路径时,可选用此技能。支持定性结构分析与定量概率计算。

SKILL.md
--- frontmatter
name: fault-tree-analysis
description: Conduct Fault Tree Analysis (FTA) to systematically identify and analyze causes of system failures using Boolean logic gates. Top-down deductive method for safety and reliability engineering. Use when analyzing system failures, evaluating safety-critical designs, calculating failure probabilities, identifying minimal cut sets, assessing redundancy effectiveness, or when user mentions "fault tree", "FTA", "system failure analysis", "minimal cut sets", "safety analysis", "failure probability", "AND/OR gates", or needs to trace failure pathways from top event to basic events. Supports qualitative structure analysis and quantitative probability calculations.
tools: Read, Bash
model: sonnet
color: red
field: quality
expertise: expert

Fault Tree Analysis (FTA)

Conduct systematic Fault Tree Analysis using a structured, Q&A-based approach with Boolean logic gates, minimal cut set identification, and optional probability calculations.

Overview

Fault Tree Analysis is a top-down, deductive failure analysis method that maps how combinations of lower-level events (basic events) lead to an undesired system-level event (top event). Uses Boolean logic gates (AND, OR) to represent relationships between events.

Key Principle: One fault tree analyzes one specific undesired event. Start at the top (what failed?) and work down (what caused it?).

Analysis Types:

  • Qualitative: Identify failure pathways, minimal cut sets, single points of failure
  • Quantitative: Calculate failure probabilities using component failure data

Workflow

Phase 1: System Definition & Scope

Collect from user:

  1. What system or process is being analyzed?
  2. What are the system boundaries (what's in scope vs. out of scope)?
  3. What are the operating conditions and assumptions?
  4. What documentation exists (schematics, P&IDs, operating procedures)?
  5. What is the purpose of this analysis (design review, incident investigation, safety case)?

Outputs:

  • System description with boundaries
  • Operating mode(s) under analysis
  • List of assumptions and exclusions

Phase 2: Top Event Definition

Collect from user:

  1. What is the single undesired outcome to analyze?
  2. How is this event defined (what state constitutes "failure")?
  3. What is the severity/criticality of this event?
  4. What is the mission time or exposure period?

Quality Gate - Top Event Must Be:

  • Single, specific, unambiguous event
  • Clearly defined failure state (not vague)
  • At appropriate system level (not too high or too low)
  • Observable or detectable

Good Example: "Pump fails to deliver required flow rate (>100 GPM) during normal operation" Poor Example: "System doesn't work" (too vague)

Phase 3: Fault Tree Construction

Build the tree iteratively from top to bottom:

For each event (starting with top event):

  1. Identify immediate causes: "What events could directly cause this?"
  2. Determine gate type:
    • OR gate: ANY one cause is sufficient (independent causes)
    • AND gate: ALL causes required simultaneously (redundancy/barriers)
  3. Classify event type:
    • Intermediate event (rectangle): Requires further development
    • Basic event (circle): Component failure, terminal point
    • Undeveloped event (diamond): Insufficient data or out of scope
    • House event (house symbol): Normal occurrence, switch on/off
    • External event (house): Environmental or expected condition
  4. Continue developing until all branches terminate in basic/undeveloped events

Stopping Criteria for Branch Development:

  • Component-level failure reached (basic event)
  • Out of scope (undeveloped event)
  • Normal expected condition (house event)
  • Insufficient information available

Critical Rules:

  • Each event must have clear, unambiguous description
  • No redundant events (same failure in multiple places)
  • No "miracles" (events that cannot physically occur)
  • Consistent naming conventions throughout

Phase 4: Qualitative Analysis

Identify Minimal Cut Sets (MCS): Minimal cut sets are the smallest combinations of basic events that cause the top event.

  • Order 1 MCS (single events): Most critical - single points of failure
  • Order 2 MCS (pairs): Critical for redundant systems
  • Higher order MCS: Less critical, require multiple failures

Analysis Tasks:

  1. List all minimal cut sets by order
  2. Identify single points of failure (Order 1)
  3. Assess common cause failure potential
  4. Evaluate effectiveness of redundancy

Run python scripts/calculate_fta.py --qualitative for automated MCS extraction.

Phase 5: Quantitative Analysis (Optional)

If failure probability data is available:

Collect failure data for each basic event:

  • Failure rate (λ) or probability (P)
  • Mission time or exposure period
  • Data source (field data, handbook, estimate)
  • Confidence level

Calculations:

  • OR gate: P(output) ≈ P(A) + P(B) - P(A)×P(B) ≈ P(A) + P(B) for small probabilities
  • AND gate: P(output) = P(A) × P(B) (for independent events)

Calculate:

  1. Probability of each minimal cut set
  2. Top event probability (sum of MCS probabilities with adjustments for overlapping events)
  3. Importance measures (Fussell-Vesely, Birnbaum)

Run python scripts/calculate_fta.py --quantitative with probability data.

Phase 6: Common Cause Failure Analysis

Identify potential common causes across basic events:

  • Environmental (temperature, humidity, EMI)
  • Manufacturing (batch defects, supplier issues)
  • Maintenance (common procedures, same personnel)
  • Design (same components, shared software)
  • Human error (operator mistakes, procedure gaps)

For AND gates (redundant systems): Common cause failures can defeat redundancy. Apply beta-factor model if quantifying:

  • P(CCF) = β × P(independent failure)
  • Typical β values: 1-10% depending on diversity measures

Phase 7: Documentation & Reporting

Generate professional outputs:

  • python scripts/generate_diagram.py - SVG fault tree diagram
  • python scripts/generate_report.py - Comprehensive HTML report

Symbols Reference

SymbolNameDescription
RectangleIntermediate EventFault resulting from combination of inputs; requires gate
CircleBasic EventComponent failure; terminal event with probability data
DiamondUndeveloped EventNot further developed (out of scope or insufficient data)
HouseHouse EventExpected occurrence; can be set TRUE/FALSE
Flat OR gateOR GateOutput if ANY input occurs
Flat AND gateAND GateOutput if ALL inputs occur
TriangleTransferConnects to another tree section

Quality Scoring

Each analysis scored on six dimensions (see references/quality-rubric.md):

DimensionWeightDescription
System Definition15%Clear boundaries, assumptions, operating conditions
Top Event Clarity15%Specific, unambiguous, appropriate level
Tree Completeness25%All pathways developed, no gaps, consistent logic
Minimal Cut Sets20%Correctly identified, analyzed for SPOFs
Quantification15%Accurate calculations, appropriate data sources
Actionability10%Identifies design improvements, risk mitigations

Scoring Scale: Each dimension rated 1-5 (Inadequate to Excellent) Overall Score: Weighted average × 20 = 0-100 points Passing Threshold: 70 points minimum

Run python scripts/score_analysis.py to calculate scores.

Common Pitfalls

See references/common-pitfalls.md for:

  • Incorrect gate selection (AND vs OR confusion)
  • Top event too vague or at wrong level
  • Missing common cause failures
  • Incomplete branch development
  • Ignoring human factors
  • Double-counting events

Examples

See references/examples.md for worked examples:

  • Pump system failure
  • Control system loss of function
  • Safety interlock bypass
  • Manufacturing equipment hazard

Integration with Other Tools

  • FMEA/FMECA: Bottom-up complements top-down FTA; use FMEA to identify basic events
  • 5 Whys: Use for detailed investigation of specific failure pathways
  • Fishbone Diagram: Brainstorm potential causes before structuring in FTA
  • Reliability Block Diagram: Alternative view of system reliability
  • Event Tree Analysis: Use FTA for initiating event probabilities

When to Use FTA

Good candidates:

  • Safety-critical system design review
  • Accident/incident investigation
  • Regulatory compliance demonstration
  • Redundancy effectiveness evaluation
  • System failure probability estimation

Consider alternatives when:

  • Need to catalog ALL failure modes (use FMEA)
  • Analyzing success paths (use Success Tree/RBD)
  • Time-sequential dependencies critical (use Event Tree)