Transform vague "AI should do X" requests into properly structured user stories with AI-specific requirements. The goal is a story that developers can estimate, sprint planners can commit to, and governance can approve.

Core principle: An AI story isn't ready for sprint until you can answer: "What accuracy is acceptable? What happens on low confidence? What data exists to train/evaluate this?"

Refinement Status Categories

dot

digraph status {
    rankdir=LR;
    node [shape=box];

    raw [label="Raw Idea"];
    needs_feas [label="NEEDS_FEASIBILITY\n(Send to feasibility tester)"];
    needs_clar [label="NEEDS_CLARIFICATION\n(Questions block sprint)"];
    blocked [label="BLOCKED\n(Prerequisites not met)"];
    ready [label="READY_FOR_SPRINT\n(Can commit)"];

    raw -> needs_feas [label="uncertain if LLM can do this"];
    raw -> needs_clar [label="unclear requirements"];
    raw -> blocked [label="missing data/integration"];
    raw -> ready [label="all clear"];
}

Status	Meaning	Action
READY_FOR_SPRINT	All acceptance criteria clear, data exists, integrations accessible	Can include in sprint planning
NEEDS_CLARIFICATION	Open questions block commitment	List questions, get answers first
NEEDS_FEASIBILITY	Uncertain if LLM can achieve required accuracy	Run through prompt-feasibility-tester
BLOCKED	Prerequisites not met	List blockers, cannot plan until resolved

Output Format

yaml

story:
  id: "[PROJECT-XXX]"
  title: "[Concise capability name]"

  user_story:
    role: "[Who benefits]"
    capability: "[What they get]"
    value: "[Why it matters - specific, measurable]"

  acceptance_criteria:
    accuracy:
      - criterion: "[Specific metric]"
        threshold: "[Number]"
        measurement: "[How verified]"

    confidence_handling:
      high:
        threshold: "[≥ X.XX]"
        action: "[What happens]"
      medium:
        threshold: "[X.XX - Y.YY]"
        action: "[What happens]"
      low:
        threshold: "[< Z.ZZ]"
        action: "[What happens]"

    human_oversight:
      - "[Specific oversight requirement]"

    fallback_behavior:
      api_failure: "[What happens]"
      low_confidence: "[What happens]"
      unexpected_input: "[What happens]"
      edge_cases: "[What happens]"

  technical_requirements:
    input:
      format: "[What goes in]"
      constraints: "[Size limits, types]"
    output:
      format: "[What comes out]"
      schema: "[Structure]"
    latency: "[P95 requirement]"
    volume: "[Daily/hourly throughput]"
    integrations:
      - system: "[System name]"
        access: "[Read/Write]"
        status: "[Available/Needed]"

  data_requirements:
    training_data:
      source: "[Where from]"
      quantity: "[How much]"
      availability: "[Exists/Must create]"
      timeline: "[Weeks to obtain]"
    evaluation_set:
      size: "[Number of items]"
      ground_truth: "[Who provides labels]"
      availability: "[Exists/Must create]"
    ongoing:
      refresh_frequency: "[How often]"
      feedback_loop: "[How corrections flow back]"

  definition_of_done:
    metrics:
      - "[Specific metric with threshold]"
    testing:
      - "[Required test type]"
    governance:
      - "[Required approval]"

  refinement_status: [READY_FOR_SPRINT|NEEDS_CLARIFICATION|NEEDS_FEASIBILITY|BLOCKED]

  # If not READY_FOR_SPRINT:
  open_questions:
    - "[Question that must be answered]"
  blockers:
    - "[Thing that must be resolved]"
    timeline: "[Estimated time to unblock]"

  dependencies:
    must_exist_before_sprint:
      - "[Hard prerequisite]"
    can_parallel:
      - "[Can develop in parallel]"

  estimated_effort: "[Story points or T-shirt size]"

AI-Specific Acceptance Criteria

1. Accuracy Requirements

yaml

accuracy:
  - criterion: "Classification accuracy"
    threshold: "≥ 98%"
    measurement: "On 1,000-item evaluation set"
  - criterion: "False positive rate"
    threshold: "≤ 0.5%"
    measurement: "Human audit of sample"

2. Confidence Handling (Required)

3. Human Oversight Model

4. Fallback Behavior

yaml

fallback_behavior:
  api_failure: "Queue for retry, alert after 3 failures"
  low_confidence: "Route to human queue"
  unexpected_input: "Reject with clear error, log for review"
  edge_cases: "Flag for senior review"

Data Requirements Checklist

Refinement Questions

Accuracy

Confidence Handling

Human Oversight

Data

Integration

Governance

Common Mistakes

Mistake	Why It's Wrong	Do This Instead
"AI should do X"	Vague value proposition	Specific measurable outcome
"Accuracy should be high"	No threshold	"≥ 98% on evaluation set"
"We'll review edge cases"	No structure	Define confidence tiers
"Data is available"	Unverified	Source, quantity, timeline
"Ready for sprint"	No checklist	Status with open questions
Binary confidence	Just high/low	Three tiers with actions
No fallback	Assumes success	Handle every failure mode

Tier	Threshold	Typical Actions
HIGH	≥ 0.95	Auto-process, audit sample
MEDIUM	0.80-0.94	Route to human review
LOW	< 0.80	Escalate, don't auto-process

Data Type	Must Answer
Training data	Source? Quantity? Exists or must create? Timeline?
Evaluation set	Size? Ground truth provider? Labeled?
Ongoing maintenance	Refresh frequency? Feedback mechanism?

ai-backlog-refiner

AI Backlog Refiner

Overview

Refinement Status Categories

Output Format

AI-Specific Acceptance Criteria

1. Accuracy Requirements

2. Confidence Handling (Required)

3. Human Oversight Model

4. Fallback Behavior

Data Requirements Checklist

Refinement Questions

Accuracy

Confidence Handling

Human Oversight

Data

Integration

Governance

Common Mistakes

Financial Services Context

Regulatory Awareness

Audit Requirements

Model Risk Management

Red Flags in Your Output

Sprint Readiness Checklist