Debugging Workflow Failures in Terra

This skill teaches you how to efficiently debug failed Terra workflow submissions without exhausting your context window.

Efficient Debugging Sequence

Follow this sequence to minimize context usage while maximizing diagnostic information:

code

get_submission_status(workspace_namespace, workspace_name, submission_id)

This returns:

What to look for: Identify failed workflow IDs from the status_summary and workflows array.

code

get_job_metadata(workspace_namespace, workspace_name, submission_id, workflow_id, mode="summary")

The summary mode returns a context-efficient view with:

What to look for: The failed_tasks array contains task names, shard indices, error messages, and log URLs.

code

get_workflow_logs(workspace_namespace, workspace_name, submission_id, workflow_id, fetch_content=False)

Returns GCS URLs for stderr/stdout without fetching content. Use this to identify which logs exist before deciding what to fetch.

Only fetch logs for tasks you need to debug:

code

get_workflow_logs(workspace_namespace, workspace_name, submission_id, workflow_id, fetch_content=True)

Logs are automatically truncated (first 5K + last 20K chars) to preserve error messages while staying context-efficient.

If stderr logs don't explain the failure, check for infrastructure issues:

code

get_batch_job_status(workspace_namespace, workspace_name, submission_id, workflow_id, task_name)

Signs you need this tool:

What this tool detects:

•Symptoms: Exit code 137, "Killed" in logs, or OOM detected by get_batch_job_status
•Fix: Increase memory in WDL runtime section: memory: "32 GB"

•Symptoms: get_batch_job_status shows docker_pull_failure or docker_pull_rate_limit
•Fix: Use authenticated pulls, switch to private registry, or wait and retry

•Symptoms: get_batch_job_status shows preemption issue
•Note: Task will be automatically retried. Consider non-preemptible VMs for time-sensitive work

When a submission has many failures, group them by pattern:

Never do this:

Do this instead: