Fix Buildkite CI
Overview
Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.
Target Selection
Resolve triage target with this precedence:
- •If user provides a Buildkite build URL, use that build directly.
- •Else if user specifies a branch and/or a pipeline (for example
pull-request,main-cron), use the specified scope. - •Else default to the current git branch and inspect the checks for the PR associated with that branch.
Workflow
- •Identify the failing Buildkite build(s).
- •Retrieve build JSON and list failed jobs.
- •Pull job logs and extract the first concrete failure signal.
- •Inspect artifacts when top-level logs are truncated.
- •Map failure to root cause and apply a focused fix.
- •Verify locally where feasible and summarize evidence.
Use bk CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via curl.
For exact commands and endpoint patterns, read references/buildkite-ci-triage.md.
Step 1: Identify Failing Buildkite Checks
When no explicit target is given, find the PR for the current branch first, then run gh pr checks <PR_NUMBER> to find failing checks and capture Buildkite URLs (.../builds/<N>).
If user specifies a branch/pipeline, list and filter builds with bk build list using those parameters.
If user provides a Buildkite build URL, skip discovery and start from that build number.
Step 2: Pull Build JSON and Failed Jobs
Fetch builds/<N>.json, then list failed jobs by non-zero exit_status.
Capture at least:
- •pipeline
- •build number
- •job id
- •job name
- •exit status
Step 3: Extract the Concrete Failure
Fetch each failed job log and search for high-signal patterns:
- •
query result mismatch - •
[Diff] (-expected|+actual) - •
query is expected to fail with error: - •panic/assertion lines
- •deterministic simulation error markers
- •OOM/timeout/cancellation markers
Stop once you have one concrete failing file/case and mismatch.
Step 4: Fall Back to Artifacts
If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:
- •
risedev-logs.zip - •
risedev-logs/nodetype-*.log
Extract and search artifact logs for the exact mismatch.
Step 5: Apply Focused Fixes
Prefer minimal fixes tied to evidence:
- •SQLLogicTest mismatch: update expected sections in the correct
.slt/.slt.partfile only when query output change is intentional. - •Wrong runtime behavior: fix source code and keep tests as-is.
- •Flaky/cancellation-only signal (
143): treat as infra/cancel unless corroborated by product errors.
Avoid broad "retry and hope" actions without root-cause evidence.
Step 6: Verify and Report
Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.
Always report:
- •failing check/build/job identifiers
- •failing file/test/case
- •exact mismatch/error evidence
- •applied fix (files changed)
- •verification status and remaining risk
Buildkite-Specific Heuristics
- •Exit code
105: often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch. - •Exit code
4: common in simulation/recovery steps; inspect uploaded simulation logs. - •Exit code
143: usually cancellation/termination, not a deterministic product regression. - •
raw_log_urlmay be null in JSON; use explicit job log endpoints by job id. - •Prefer JSON endpoints plus
jq; avoid scraping large HTML pages.