AgentSkillsCN

bisect-triage

通过二分法提交记录,对失败的 FBOSS 测试进行故障排除,并深入探究代码变更。适用于测试中断且需要定位导致故障的提交记录时使用。

SKILL.md
--- frontmatter
name: bisect-triage
description: Triage a failing FBOSS test by bisecting commits and investigating code changes. Use when a test is broken and you need to find the commit that caused the failure.
argument-hint: "[test-regex] [bad-rev] [good-rev]"
disable-model-invocation: true
user-invocable: true

FBOSS Test Failure Triage

Triage a test failure by running two parallel threads:

  1. Bisect — Use bisect2 to find the exact breaking commit
  2. Code investigation — Search recent commits and test code to identify likely culprits

Output Format

IMPORTANT: Always present the bisect2 command first at the top of your output so the user can kick it off immediately while you continue investigating. The bisect takes a long time to run, so getting it started early saves significant time.

Structure your output as:

  1. Bisect command (first thing shown, ready to copy-paste)
  2. Code investigation findings (while bisect runs)
  3. Log analysis (if log file provided)
  4. Full report with culprit analysis, failure modes, and next steps

Information Gathering

Before starting, collect the following from the user. Ask for any missing pieces:

  • Test regex: Pattern matching the failing test name(s) (for --regex)
  • Bad revision: Commit hash where the test is known to fail
  • Good revision: Commit hash where the test last passed
  • Netcastle team: e.g., bcm_agent_test, sai_agent_test (default: bcm_agent_test)
  • Test config: e.g., tomahawk3_alpm/6.5.26_6.5.26
  • Buck mode: e.g., opt-asan (default: opt-asan)
  • Netcastle node name: The conveyor node name if available (for context)
  • Log file: If the user has a netcastle log file, read it for failure details (see Log Analysis below)

If the user provides arguments via the slash command, parse them as: /bisect-triage <test-regex> <bad-rev> <good-rev>

Thread 1: Bisect

Present this command to the user FIRST before doing any investigation.

bisect2 Tool

The bisect2 tool is at:

code
buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2

CRITICAL: Must be run from fbsource/ root (NOT fbcode/). The --bisect_dirs flag uses repo-root-relative paths (e.g., fbcode/fboss), and the internal hg log command silently returns zero commits if the path doesn't exist relative to CWD.

Required positional args:

  • GOOD_COMMIT — commit hash where the test passes
  • BAD_COMMIT — commit hash where the test fails

Key options:

OptionDescription
--netcastle_cmd "<cmd>"Netcastle command string to use as the test oracle
--job NN-ary parallel bisect (N>1 requires --use_sandcastle)
--use_sandcastleRun builds/tests remotely via Sandcastle
--retry NRetry UNKNOWN results (Sandcastle only)
--bisect_dirs "dir1 dir2"Scope commit history to specific directories (must start with fbcode/)

Validation rules:

  • Exactly one of --fbpkg, --buck_target, or --netcastle_cmd must be specified
  • --job > 1 requires --use_sandcastle

Assembling the bisect2 command

Construct the command like this:

bash
cd ~/fbsource  # MUST be at repo root
buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2 \
  <GOOD_COMMIT> <BAD_COMMIT> \
  --use_sandcastle \
  --netcastle_cmd "netcastle --team <TEAM> --jobs 12 --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> --regex '<TEST_REGEX>'" \
  --bisect_dirs "fbcode/fboss"

Suggest --job 5 for faster parallel bisect when using Sandcastle.

Quick Verification (skip full bisect)

If code investigation has identified a likely culprit, determine which type of breakage it is:

Type 1: True regression — tests that previously passed now fail. Use bisect2 with just the suspect and parent:

bash
cd ~/fbsource  # MUST be at repo root

# Get the parent of the suspect commit (use full hash, not short)
sl log --rev 'parents(<SUSPECT_COMMIT>)' -T '{node}\n'

# Run bisect2 with the narrow 2-commit range
buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2 \
  <PARENT_COMMIT> <SUSPECT_COMMIT> \
  --use_sandcastle \
  --netcastle_cmd "netcastle --team <TEAM> --jobs 12 --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> --regex '<TEST_REGEX>' --skip-filtering-by-test-state --run-disabled"

Type 2: Newly enabled tests that are broken on arrival — a commit enables tests on a new platform but they fail. Bisecting against the parent won't work because the tests didn't exist before. Instead, just reproduce the failure at the current revision:

bash
netcastle --team <TEAM> --jobs 12 \
  --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> \
  --basset-query <BASSET_QUERY> \
  --regex '<TEST_REGEX>' \
  --skip-filtering-by-test-state --run-disabled

The proof for Type 2 is:

  1. The test enablement code was added in the suspect commit (read the diff to confirm)
  2. The failure reproduces at the current revision
  3. Optionally, the commit's own test plan already shows failures

Netcastle arguments reference

These optional netcastle arguments may be relevant depending on the use case:

ArgumentPurpose
--update-agent-fbpkg-info <pkg>Logs the agent fbpkg version to Scuba (metadata only, no behavior change)
--run-disabledInclude tests disabled by Test Console
--basset-query <query>Target specific lab devices, e.g., fboss.kernel.legacy/asic=tomahawk3
--skip-filtering-by-test-stateSkip Test Console state filtering, run all tests
--regex <pattern>Filter tests by regex (maps to --gtest_filter for GTest). Multiple --regex flags act as AND
--purpose <purpose>Test purpose context, e.g., continuous (for conveyor runs)

Conveyor pipeline runs typically use all of these flags together:

bash
netcastle --team bcm_agent_test --purpose continuous --jobs 12 \
  --test-config <CONFIG> --buck-mode opt-asan \
  --update-agent-fbpkg-info <FBPKG_ID> \
  --run-disabled --basset-query <BASSET_QUERY> \
  --skip-filtering-by-test-state

Thread 2: Code Investigation

While bisect runs, investigate in parallel:

  1. List commits in the range touching fboss:

    bash
    sl log --rev '<good>::<bad> & file("fbcode/fboss/**")' -T '{node|short} {date|isodate} {desc|firstline}\n' -l 100
    
  2. Search for keyword-relevant commits using keywords from the failing test names (e.g., ecmp, spillover, warmboot, resource):

    bash
    sl log --rev '<good>::<bad> & file("fbcode/fboss/**")' -T '{node|short} {date|isodate} {desc|firstline}\n' --keyword <keyword>
    
  3. Read the failing test source code to understand what the test verifies. Use search_files MCP tool to find test class definitions.

  4. Check if the test code itself was modified in the commit range.

  5. Look at the diff of suspicious commits:

    bash
    sl diff --rev 'parents(<commit>) & ancestors(<commit>)' --rev <commit>
    
  6. Check the Phabricator diff for the suspicious commit — look at the test plan to see if the author noticed failures.

Known Bad / Unsupported Test Config

Check whether the failing tests are already marked as known bad or unsupported in configerator. The config files are in the user's configerator checkout (typically ~/configerator/).

Config file locations

TeamConfig File
bcm_agent_test~/configerator/source/neteng/netcastle/test_config/bcm_agent_test.cinc
sai_agent_test~/configerator/source/neteng/netcastle/test_config/sai_agent_test.cinc

Structure

Each config file defines:

  • KnownBadTestPattern — tests known to fail (skipped in results, regex matched against test name)
  • UnsupportedTestPattern — tests not supported on a platform (never run)

Tests are organized per platform config. For example in bcm_agent_test.cinc:

python
bcm_common_known_bad_tests = [...]          # Shared across all BCM platforms
bcm_th3_known_bad_tests = [...]             # TH3-specific known bads
bcm_th4_known_bad_tests = [...]             # TH4-specific known bads

bcm_agent_test_known_bad_tests_map = {
    "tomahawk3_alpm/6.5.26_6.5.26": bcm_common_known_bad_tests + bcm_th3_known_bad_tests,
    "tomahawk4_alpm/6.5.26_6.5.26": bcm_common_known_bad_tests + bcm_th4_known_bad_tests,
}

For sai_agent_test.cinc, the structure is similar but with per-SDK and per-ASIC-family lists.

How to check

Search the config for the failing test name:

bash
grep -i '<TEST_CLASS_NAME>' ~/configerator/source/neteng/netcastle/test_config/<TEAM>.cinc

If the test is not listed, it's expected to pass. If it's failing and not in known bad, it's a real regression that needs to be fixed or added to known bad.

Adding to known bad

To add a test to known bad, edit the appropriate list in the .cinc file. Use a regex pattern that matches the test name:

python
bcm_th3_known_bad_tests = [
    KnownBadTestPattern(test_name_regex="{}$".format(test_group))
    for test_group in [
        # ... existing entries ...
        "AgentEcmpSpilloverTest.*",  # TODO(author): fix TH3 spillover tests, D92924806
    ]
]

Then submit the configerator change via the normal diff workflow. Always include a TODO with the author and context so it gets fixed rather than forgotten.

Log Analysis

When the user provides a netcastle log file, analyze it for failure clues. The log is typically large (1MB+), so use Grep rather than reading the entire file.

Key patterns to search for in logs

1. Find all FAILED and PASSED tests:

code
grep FAILED <logfile>
grep PASSED <logfile>

2. Identify the failure mode — check which type of failure:

  • exit-code exited 1 → Test assertion failure (GTest EXPECT/ASSERT)
  • signal killed ABRT → Process crash, usually a LOG(FATAL) or ASan error
  • timeout → Test hung or exceeded time limit

3. Find assertion details:

code
grep -A 5 "Check failed\|EXPECT_EQ\|ASSERT_EQ\|Expected equality" <logfile>
grep -B 5 "ABORTING" <logfile>

4. Check for ASan (AddressSanitizer) errors:

code
grep -A 15 "ERROR: AddressSanitizer" <logfile>
grep "heap-buffer-overflow\|use-after-free\|stack-buffer-overflow" <logfile>

5. Find crash stack traces — look for LOG(FATAL) abort paths:

code
grep -A 10 "SwSwitchInitializer::initThread\|folly::LogStreamVoidify" <logfile>

6. Extract the runner command (shows exact flags used):

code
grep "Runner CMD:" <logfile>

7. Get the fbcode hash (confirms the revision under test):

code
grep "Fbcode hash:" <logfile>

8. Check for core dumps (useful for further debugging):

code
grep -A 2 "Core dumps:" <logfile>

9. Find everpaste links for full stdout/stderr (log file only shows last N lines):

code
grep "everpaste.*handle=" <logfile>

Common failure patterns

PatternTypical Cause
warm_boot fails but cold_boot passes for same testWarm boot initialization or state replay issue
All tests in a test class failTest class setup/config issue (e.g., wrong switching mode, wrong resource limits)
SwSwitchInitializer::initThread in crash stackFatal error during switch warm boot initialization
ASan ABORTING with no ASan errorUsually a LOG(FATAL) / CHECK() failure, not a memory error — the ASan legend in output is from the binary being built with ASan, not from an ASan detection
Tests that were previously not running now failNewly enabled tests that weren't validated on this platform

Reporting

Present findings as:

  1. Most likely culprit commit with diff number, title, author, and evidence
  2. What the commit changed and why it likely causes the failure
  3. Failure mode — assertion failure vs crash vs timeout, with key error messages
  4. Cold boot vs warm boot pattern — which boot modes are affected
  5. The bisect2 command ready to run for confirmation
  6. Suggested next steps (talk to author, revert, fix forward, etc.)