FBOSS Test Failure Triage

Triage a test failure by running two parallel threads:

•Bisect — Use bisect2 to find the exact breaking commit
•Code investigation — Search recent commits and test code to identify likely culprits

Output Format

IMPORTANT: Always present the bisect2 command first at the top of your output so the user can kick it off immediately while you continue investigating. The bisect takes a long time to run, so getting it started early saves significant time.

Structure your output as:

•Bisect command (first thing shown, ready to copy-paste)
•Code investigation findings (while bisect runs)
•Log analysis (if log file provided)
•Full report with culprit analysis, failure modes, and next steps

Information Gathering

Before starting, collect the following from the user. Ask for any missing pieces:

•Test regex: Pattern matching the failing test name(s) (for --regex)
•Bad revision: Commit hash where the test is known to fail
•Good revision: Commit hash where the test last passed
•Netcastle team: e.g., bcm_agent_test, sai_agent_test (default: bcm_agent_test)
•Test config: e.g., tomahawk3_alpm/6.5.26_6.5.26
•Buck mode: e.g., opt-asan (default: opt-asan)
•Netcastle node name: The conveyor node name if available (for context)
•Log file: If the user has a netcastle log file, read it for failure details (see Log Analysis below)

If the user provides arguments via the slash command, parse them as: /bisect-triage <test-regex> <bad-rev> <good-rev>

Thread 1: Bisect

Present this command to the user FIRST before doing any investigation.

bisect2 Tool

The bisect2 tool is at:

code

buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2

CRITICAL: Must be run from fbsource/ root (NOT fbcode/). The --bisect_dirs flag uses repo-root-relative paths (e.g., fbcode/fboss), and the internal hg log command silently returns zero commits if the path doesn't exist relative to CWD.

Required positional args:

•GOOD_COMMIT — commit hash where the test passes
•BAD_COMMIT — commit hash where the test fails

Key options:

Option	Description
`--netcastle_cmd "<cmd>"`	Netcastle command string to use as the test oracle
`--job N`	N-ary parallel bisect (N>1 requires `--use_sandcastle`)
`--use_sandcastle`	Run builds/tests remotely via Sandcastle
`--retry N`	Retry UNKNOWN results (Sandcastle only)
`--bisect_dirs "dir1 dir2"`	Scope commit history to specific directories (must start with `fbcode/`)

Validation rules:

•Exactly one of --fbpkg, --buck_target, or --netcastle_cmd must be specified
•--job > 1 requires --use_sandcastle

Assembling the bisect2 command

Construct the command like this:

bash

cd ~/fbsource  # MUST be at repo root
buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2 \
  <GOOD_COMMIT> <BAD_COMMIT> \
  --use_sandcastle \
  --netcastle_cmd "netcastle --team <TEAM> --jobs 12 --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> --regex '<TEST_REGEX>'" \
  --bisect_dirs "fbcode/fboss"

Suggest --job 5 for faster parallel bisect when using Sandcastle.

Quick Verification (skip full bisect)

If code investigation has identified a likely culprit, determine which type of breakage it is:

Type 1: True regression — tests that previously passed now fail. Use bisect2 with just the suspect and parent:

bash

cd ~/fbsource  # MUST be at repo root

# Get the parent of the suspect commit (use full hash, not short)
sl log --rev 'parents(<SUSPECT_COMMIT>)' -T '{node}\n'

# Run bisect2 with the narrow 2-commit range
buck2 run fbcode//fboss/util/facebook/devdbgtest:fboss-dev-helper bisect2 \
  <PARENT_COMMIT> <SUSPECT_COMMIT> \
  --use_sandcastle \
  --netcastle_cmd "netcastle --team <TEAM> --jobs 12 --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> --regex '<TEST_REGEX>' --skip-filtering-by-test-state --run-disabled"

Type 2: Newly enabled tests that are broken on arrival — a commit enables tests on a new platform but they fail. Bisecting against the parent won't work because the tests didn't exist before. Instead, just reproduce the failure at the current revision:

bash

netcastle --team <TEAM> --jobs 12 \
  --test-config <TEST_CONFIG> --buck-mode <BUCK_MODE> \
  --basset-query <BASSET_QUERY> \
  --regex '<TEST_REGEX>' \
  --skip-filtering-by-test-state --run-disabled

The proof for Type 2 is:

•The test enablement code was added in the suspect commit (read the diff to confirm)
•The failure reproduces at the current revision
•Optionally, the commit's own test plan already shows failures

Netcastle arguments reference

These optional netcastle arguments may be relevant depending on the use case:

Argument	Purpose
`--update-agent-fbpkg-info <pkg>`	Logs the agent fbpkg version to Scuba (metadata only, no behavior change)
`--run-disabled`	Include tests disabled by Test Console
`--basset-query <query>`	Target specific lab devices, e.g., `fboss.kernel.legacy/asic=tomahawk3`
`--skip-filtering-by-test-state`	Skip Test Console state filtering, run all tests
`--regex <pattern>`	Filter tests by regex (maps to `--gtest_filter` for GTest). Multiple `--regex` flags act as AND
`--purpose <purpose>`	Test purpose context, e.g., `continuous` (for conveyor runs)

Conveyor pipeline runs typically use all of these flags together:

bash

netcastle --team bcm_agent_test --purpose continuous --jobs 12 \
  --test-config <CONFIG> --buck-mode opt-asan \
  --update-agent-fbpkg-info <FBPKG_ID> \
  --run-disabled --basset-query <BASSET_QUERY> \
  --skip-filtering-by-test-state

Thread 2: Code Investigation

While bisect runs, investigate in parallel:

•

List commits in the range touching fboss:

bash

sl log --rev '<good>::<bad> & file("fbcode/fboss/**")' -T '{node|short} {date|isodate} {desc|firstline}\n' -l 100

•
Search for keyword-relevant commits using keywords from the failing test names (e.g., ecmp, spillover, warmboot, resource):
bash
```
sl log --rev '<good>::<bad> & file("fbcode/fboss/**")' -T '{node|short} {date|isodate} {desc|firstline}\n' --keyword <keyword>
```
•
Read the failing test source code to understand what the test verifies. Use search_files MCP tool to find test class definitions.
•
Check if the test code itself was modified in the commit range.

•

Look at the diff of suspicious commits:

bash

sl diff --rev 'parents(<commit>) & ancestors(<commit>)' --rev <commit>

•
Check the Phabricator diff for the suspicious commit — look at the test plan to see if the author noticed failures.

Known Bad / Unsupported Test Config

Check whether the failing tests are already marked as known bad or unsupported in configerator. The config files are in the user's configerator checkout (typically ~/configerator/).

Config file locations

Team	Config File
`bcm_agent_test`	`~/configerator/source/neteng/netcastle/test_config/bcm_agent_test.cinc`
`sai_agent_test`	`~/configerator/source/neteng/netcastle/test_config/sai_agent_test.cinc`

Structure

Each config file defines:

•KnownBadTestPattern — tests known to fail (skipped in results, regex matched against test name)
•UnsupportedTestPattern — tests not supported on a platform (never run)

Tests are organized per platform config. For example in bcm_agent_test.cinc:

python

bcm_common_known_bad_tests = [...]          # Shared across all BCM platforms
bcm_th3_known_bad_tests = [...]             # TH3-specific known bads
bcm_th4_known_bad_tests = [...]             # TH4-specific known bads

bcm_agent_test_known_bad_tests_map = {
    "tomahawk3_alpm/6.5.26_6.5.26": bcm_common_known_bad_tests + bcm_th3_known_bad_tests,
    "tomahawk4_alpm/6.5.26_6.5.26": bcm_common_known_bad_tests + bcm_th4_known_bad_tests,
}

For sai_agent_test.cinc, the structure is similar but with per-SDK and per-ASIC-family lists.

How to check

Search the config for the failing test name:

bash

grep -i '<TEST_CLASS_NAME>' ~/configerator/source/neteng/netcastle/test_config/<TEAM>.cinc

If the test is not listed, it's expected to pass. If it's failing and not in known bad, it's a real regression that needs to be fixed or added to known bad.

Adding to known bad

To add a test to known bad, edit the appropriate list in the .cinc file. Use a regex pattern that matches the test name:

python

bcm_th3_known_bad_tests = [
    KnownBadTestPattern(test_name_regex="{}$".format(test_group))
    for test_group in [
        # ... existing entries ...
        "AgentEcmpSpilloverTest.*",  # TODO(author): fix TH3 spillover tests, D92924806
    ]
]

Then submit the configerator change via the normal diff workflow. Always include a TODO with the author and context so it gets fixed rather than forgotten.

Log Analysis

When the user provides a netcastle log file, analyze it for failure clues. The log is typically large (1MB+), so use Grep rather than reading the entire file.

Key patterns to search for in logs

1. Find all FAILED and PASSED tests:

code

grep FAILED <logfile>
grep PASSED <logfile>

2. Identify the failure mode — check which type of failure:

•exit-code exited 1 → Test assertion failure (GTest EXPECT/ASSERT)
•signal killed ABRT → Process crash, usually a LOG(FATAL) or ASan error
•timeout → Test hung or exceeded time limit

3. Find assertion details:

code

grep -A 5 "Check failed\|EXPECT_EQ\|ASSERT_EQ\|Expected equality" <logfile>
grep -B 5 "ABORTING" <logfile>

4. Check for ASan (AddressSanitizer) errors:

code

grep -A 15 "ERROR: AddressSanitizer" <logfile>
grep "heap-buffer-overflow\|use-after-free\|stack-buffer-overflow" <logfile>

5. Find crash stack traces — look for LOG(FATAL) abort paths:

code

grep -A 10 "SwSwitchInitializer::initThread\|folly::LogStreamVoidify" <logfile>

6. Extract the runner command (shows exact flags used):

code

grep "Runner CMD:" <logfile>

7. Get the fbcode hash (confirms the revision under test):

code

grep "Fbcode hash:" <logfile>

8. Check for core dumps (useful for further debugging):

code

grep -A 2 "Core dumps:" <logfile>

9. Find everpaste links for full stdout/stderr (log file only shows last N lines):

code

grep "everpaste.*handle=" <logfile>

Common failure patterns

Pattern	Typical Cause
warm_boot fails but cold_boot passes for same test	Warm boot initialization or state replay issue
All tests in a test class fail	Test class setup/config issue (e.g., wrong switching mode, wrong resource limits)
`SwSwitchInitializer::initThread` in crash stack	Fatal error during switch warm boot initialization
ASan ABORTING with no ASan error	Usually a `LOG(FATAL)` / `CHECK()` failure, not a memory error — the ASan legend in output is from the binary being built with ASan, not from an ASan detection
Tests that were previously not running now fail	Newly enabled tests that weren't validated on this platform

Reporting

Present findings as:

•Most likely culprit commit with diff number, title, author, and evidence
•What the commit changed and why it likely causes the failure
•Failure mode — assertion failure vs crash vs timeout, with key error messages
•Cold boot vs warm boot pattern — which boot modes are affected
•The bisect2 command ready to run for confirmation
•Suggested next steps (talk to author, revert, fix forward, etc.)