Fix Buildkite CI

Overview

Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.

Target Selection

Resolve triage target with this precedence:

•If user provides a Buildkite build URL, use that build directly.
•Else if user specifies a branch and/or a pipeline (for example pull-request, main-cron), use the specified scope.
•Else default to the current git branch and inspect the checks for the PR associated with that branch.

Workflow

•Identify the failing Buildkite build(s).
•Retrieve build JSON and list failed jobs.
•Pull job logs and extract the first concrete failure signal.
•Inspect artifacts when top-level logs are truncated.
•Map failure to root cause and apply a focused fix.
•Verify locally where feasible and summarize evidence.

Use bk CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via curl.

For exact commands and endpoint patterns, read references/buildkite-ci-triage.md.

Step 1: Identify Failing Buildkite Checks

When no explicit target is given, find the PR for the current branch first, then run gh pr checks <PR_NUMBER> to find failing checks and capture Buildkite URLs (.../builds/<N>).

If user specifies a branch/pipeline, list and filter builds with bk build list using those parameters. If user provides a Buildkite build URL, skip discovery and start from that build number.

Step 2: Pull Build JSON and Failed Jobs

Fetch builds/<N>.json, then list failed jobs by non-zero exit_status.

Capture at least:

•pipeline
•build number
•job id
•job name
•exit status

Step 3: Extract the Concrete Failure

Fetch each failed job log and search for high-signal patterns:

•query result mismatch
•[Diff] (-expected|+actual)
•query is expected to fail with error:
•panic/assertion lines
•deterministic simulation error markers
•OOM/timeout/cancellation markers

Stop once you have one concrete failing file/case and mismatch.

Step 4: Fall Back to Artifacts

If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:

•risedev-logs.zip
•risedev-logs/nodetype-*.log

Extract and search artifact logs for the exact mismatch.

Step 5: Apply Focused Fixes

Prefer minimal fixes tied to evidence:

•SQLLogicTest mismatch: update expected sections in the correct .slt/.slt.part file only when query output change is intentional.
•Wrong runtime behavior: fix source code and keep tests as-is.
•Flaky/cancellation-only signal (143): treat as infra/cancel unless corroborated by product errors.

Avoid broad "retry and hope" actions without root-cause evidence.

Step 6: Verify and Report

Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.

Always report:

•failing check/build/job identifiers
•failing file/test/case
•exact mismatch/error evidence
•applied fix (files changed)
•verification status and remaining risk

Buildkite-Specific Heuristics

•Exit code 105: often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch.
•Exit code 4: common in simulation/recovery steps; inspect uploaded simulation logs.
•Exit code 143: usually cancellation/termination, not a deterministic product regression.
•raw_log_url may be null in JSON; use explicit job log endpoints by job id.
•Prefer JSON endpoints plus jq; avoid scraping large HTML pages.