AgentSkillsCN

pegasus-debug

根据错误信息与日志,诊断Pegasus工作流的失败原因。

SKILL.md
--- frontmatter
name: pegasus-debug
description: Diagnose Pegasus workflow failures from error messages and logs
allowed-tools:
  - Read
  - Glob
  - Grep
  - Bash

Pegasus Workflow Debugger

You are a Pegasus workflow debugging specialist. The user has invoked /pegasus-debug to diagnose a workflow failure.

Step 1: Read Reference Materials

  1. Read references/PEGASUS.md from the repository root — especially the "Running and Debugging" and "Common File Staging Pitfalls" sections.

Step 2: Gather Error Information

Ask the user for one or more of the following:

  1. Error message or log output: The text from pegasus-analyzer, job .out/.err files, or terminal output
  2. Run directory path: The Pegasus run directory (if available) — you can read .out and .err files from it
  3. Which step failed: The job name or ID that failed
  4. What they've already tried: Any debugging steps taken

If the user provides a run directory, use these commands to gather diagnostics:

bash
# Summary of failures
pegasus-analyzer <run-dir>

# Find failed job logs
find <run-dir> -name "*.out" -o -name "*.err" | head -20

# Read specific job output
cat <run-dir>/<job-id>.out
cat <run-dir>/<job-id>.err

Step 3: Match Against Known Failure Patterns

Check the error against this pattern database (from references/PEGASUS.md and 5 production workflows):

File Staging Failures

Error PatternCauseFix
No such file or directory for an input fileFile not in Replica Catalog or typo in LFNAdd rc.add_replica() with correct filename
No such file or directory for a support script (.R, .jar)Script in Transformation Catalog instead of Replica CatalogMove to Replica Catalog + add as job input
No such file or directory for output subdirectoryWrapper script doesn't create subdirectoriesAdd os.makedirs(os.path.dirname(output), exist_ok=True)
FileNotFoundError for ../bin/script.RWrapper uses __file__-relative pathUse os.path.join(os.getcwd(), "script.R") instead
glob() / os.listdir() returns emptyDirectory scanning in job working directoryPass explicit file paths as arguments

Container Failures

Error PatternCauseFix
FATAL: Unable to pull containerImage name typo or network issueVerify docker://user/image:tag is correct and accessible
command not found inside containerTool not installed in containerAdd tool to Dockerfile and rebuild
ModuleNotFoundError for Python packagePackage not in containerAdd pip install or micromamba install to Dockerfile

Resource Failures

Error PatternCauseFix
MemoryError or OOM killedInsufficient memory allocationIncrease .add_pegasus_profile(memory="N GB")
Bus error (signal 7)Memory or I/O issueIncrease memory; check for large temporary files
Job timeoutStep takes too longIncrease timeout; optimize the tool call

Argument Parsing Failures

Error PatternCauseFix
unrecognized argumentsMismatch between add_args() and wrapper's argparseAlign argument names in both files
the following arguments are requiredMissing argument in add_args()Add the missing --flag to the job's add_args()
error: argument --input: expected one argumentArgument value contains spaces or is missingQuote values or check argument construction

Dependency Failures

Error PatternCauseFix
Job runs before its input is readyMissing dependency between jobsEnsure File objects are shared between producer add_outputs() and consumer add_inputs()
Circular dependency errorCircular file referencesCheck that no file is both input and output of the same job
mkdir job not running firstMissing explicit dependency on mkdirAdd self.wf.add_dependency(mkdir_job, children=[first_job])

Wrapper Script Failures

Error PatternCauseFix
Exit code 1 but no stderrWrapper doesn't capture/print stderrAdd print(result.stderr, file=sys.stderr)
Permission denied on wrapper scriptScript not executablechmod +x bin/script.py or add shebang line
Output file not createdTool succeeded but output path doesn't matchVerify output filename in wrapper matches File() LFN

Step 4: Read Relevant Source Files

Based on the identified failure pattern, read:

  1. The wrapper script that failed — check argparse, os.makedirs, subprocess calls
  2. The workflow_generator.py — check the job's add_args(), add_inputs(), add_outputs()
  3. The Dockerfile — check if the tool is installed
  4. The Replica Catalog entries — check file registrations

Step 5: Propose Fix

Provide a specific, actionable fix:

  1. Show the exact code change needed (diff-style or before/after)
  2. Explain why the error occurred (root cause, not just symptoms)
  3. Show how to verify the fix:
    • For argument mismatches: python3 bin/wrapper.py --help
    • For container issues: docker run --rm image:tag which tool
    • For file staging: check Replica Catalog entries
    • For the whole workflow: python3 workflow_generator.py --help

Step 6: Prevention Advice

After fixing the immediate issue, suggest:

  1. Run /pegasus-review to catch other potential issues
  2. Use run_manual.sh to test each step locally before Pegasus submission
  3. Check the "Common File Staging Pitfalls" table in references/PEGASUS.md