Safe Refactoring for Scientific Code

Overview

Change code structure without changing behavior. Zero tolerance for behavioral changes during refactoring.

Core principle: Establish baseline, refactor, verify exact match (within floating-point noise).

Announce at start: "I'm using the safe-refactoring skill to restructure this code."

When to Use This Skill

Use for:

•Improving code readability without changing logic
•Extracting reusable functions
•Renaming variables/functions for clarity
•Reorganizing code structure
•Performance optimization (without changing numerical behavior)

Don't use for:

•Changing behavior or algorithms (use scientific-tdd instead)
•Adding new features (use scientific-tdd instead)
•Fixing bugs (use scientific-tdd or fix directly with tests)

Process Checklist

Copy to TodoWrite:

code

Safe Refactoring Progress:
- [ ] Run full test suite (establish baseline)
- [ ] Run snapshot tests (establish baseline)
- [ ] Capture coverage report
- [ ] Perform refactoring
- [ ] Run full test suite (must match baseline exactly)
- [ ] Run snapshot tests (must match baseline exactly)
- [ ] Compare coverage (should stay same or improve)
- [ ] Run quality checks (ruff + black)
- [ ] Verify no numerical differences
- [ ] Commit refactoring

Strict Rules

ZERO tolerance for:

•Any test that passed before and fails after
•Any test that failed before and passes after (suggests test was broken)
•Any snapshot differences (not even floating-point noise)
•Decreased test coverage
•Any behavioral changes

If any of these occur: Revert and investigate why.

Detailed Steps

Step 1: Run Full Test Suite (Baseline)

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -v 2>&1 | tee /tmp/baseline_tests.txt

Record:

•Total tests: grep "passed" /tmp/baseline_tests.txt
•Any failures (if refactoring existing code with known issues)
•Test execution time

Expected: All tests pass (or document any known failures)

Step 2: Run Snapshot Tests (Baseline)

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -m snapshot -v 2>&1 | tee /tmp/baseline_snapshots.txt

CRITICAL: Snapshots must match exactly after refactoring.

Expected: All snapshot tests pass

Step 3: Capture Coverage Report

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest --cov=non_local_detector --cov-report=term --cov-report=json:coverage_baseline.json

Record: Coverage percentage for files being refactored

Why: Coverage should not decrease during refactoring (ideally improves)

Step 4: Perform Refactoring

Refactoring techniques:

•

Extract function:

python

# Before
def complex_function():
    # ... 50 lines of code
    result = x * 2 + y
    # ... more code
    return final_result

# After
def complex_function():
    # ... code
    result = _calculate_intermediate(x, y)
    # ... code
    return final_result

def _calculate_intermediate(x, y):
    return x * 2 + y

•

Rename for clarity:

python

# Before
def f(x):
    return x * 2

# After
def calculate_doubled_value(value):
    return value * 2

•

Reorganize structure:

python

# Before: All in one file

# After: Separated into modules
# - core_logic.py
# - utilities.py
# - validation.py

•

Optimize performance (numerically equivalent):

python

# Before
for i in range(n):
    result[i] = f(x[i])

# After (JAX)
result = jax.vmap(f)(x)

During refactoring:

•Make small, incremental changes
•Test after each change if possible
•Keep numerical operations identical
•Maintain exact same algorithms

Step 5: Run Full Test Suite (Verify Match)

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -v 2>&1 | tee /tmp/refactored_tests.txt

Compare to baseline:

bash

diff /tmp/baseline_tests.txt /tmp/refactored_tests.txt

MUST verify:

•Same number of tests run
•Same tests pass
•Same tests fail (if any)
•Similar execution time (within 20%)

If differences:

•Any new test failures: REVERT IMMEDIATELY
•Any new test passes: Investigate (test was broken?)
•Different test count: Investigate (tests missing or duplicated?)

Step 6: Run Snapshot Tests (Verify Match)

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -m snapshot -v 2>&1 | tee /tmp/refactored_snapshots.txt

CRITICAL: Must match baseline EXACTLY.

Expected: All snapshot tests pass, no differences

If snapshot differences:

•DO NOT UPDATE SNAPSHOTS
•Investigate why behavior changed
•This is NOT a refactoring if behavior changed
•Revert and reconsider approach

Step 7: Compare Coverage

bash

/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest --cov=non_local_detector --cov-report=term --cov-report=json:coverage_refactored.json

Compare:

bash

# If you have jq installed
jq '.totals.percent_covered' coverage_baseline.json
jq '.totals.percent_covered' coverage_refactored.json

Expected:

•Coverage stays same or improves
•Never decreases

If coverage decreased:

•Some code paths no longer tested
•Investigate and fix or revert

Step 8: Run Quality Checks

bash

/Users/edeno/miniconda3/envs/spectral_connectivity/bin/ruff check src/
/Users/edeno/miniconda3/envs/spectral_connectivity/bin/ruff format src/
/Users/edeno/miniconda3/envs/non_local_detector/bin/black src/

Expected: All checks pass

Fix any issues: Refactoring is good opportunity to improve code quality

Step 9: Verify No Numerical Differences

For mathematical code, verify numerical equivalence:

bash

# Run golden regression
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
  src/non_local_detector/tests/test_golden_regression.py -v

Expected: Exact match (or differences < 1e-14)

If differences > 1e-14:

•This is NOT a pure refactoring
•Behavior has changed
•Use numerical-validation skill instead

Step 10: Commit Refactoring

Only commit if ALL checks pass:

bash

git add <refactored_files> <test_files>
git commit -m "refactor: improve <component> code structure

- Extract <function> for reusability
- Rename <variable> for clarity
- Reorganize <module> structure

No behavioral changes:
- All tests pass (N tests)
- Snapshots unchanged
- Coverage: X% → Y%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>"

Performance Optimization Refactoring

When optimizing for performance:

•

Capture performance baseline:

bash

pytest --durations=10 > /tmp/baseline_durations.txt

•
Make optimization
•
Verify numerical equivalence (use numerical-validation skill)

•

Measure performance improvement:

bash

pytest --durations=10 > /tmp/optimized_durations.txt

•

Document improvement:

code

Optimization: Use JAX vmap instead of for loop
Speedup: 3.2x (450ms → 140ms)
Numerical difference: < 1e-14 (verified)

Integration with Other Skills

•Before refactoring: Consider if change actually needs new behavior (use scientific-tdd instead)
•With numerical-validation: If refactoring mathematical code, use numerical-validation to verify equivalence
•With jax skill: When optimizing JAX code, use jax skill for best practices

Example Workflow

Task: Extract position decoding logic into reusable function

code

1. Baseline:
   - Run pytest: 427 passed, 0 failed
   - Run snapshots: 15 passed, 0 failed
   - Coverage: 69%

2. Refactor:
   - Extract _decode_position_from_posterior() function
   - Update 3 call sites to use new function
   - No logic changes, just extraction

3. Verify:
   - Run pytest: 427 passed, 0 failed ✓
   - Run snapshots: 15 passed, 0 failed ✓
   - Coverage: 69% (unchanged) ✓

4. Quality:
   - Ruff: All checks pass ✓
   - Black: Formatted ✓

5. Commit:
   "refactor: extract position decoding into reusable function"

Red Flags

STOP and revert if:

•Any test changes status (pass → fail or fail → pass)
•Any snapshot differences appear
•Coverage decreases
•Numerical differences > 1e-14
•You're tempted to update snapshots
•You're adding new logic (use scientific-tdd instead)

Safe to proceed if:

•All tests match baseline exactly
•No snapshot changes
•Coverage same or better
•Code quality improves
•No new functionality added

Common Mistakes

"It's just a small behavioral change"

•No such thing in refactoring
•Any behavioral change = not refactoring
•Use scientific-tdd for behavioral changes

"I'll update the snapshots since the new output is better"

•That's not refactoring, it's changing behavior
•Refactoring = zero snapshot changes
•Use scientific-tdd if output should change

"Tests are slow, I'll skip them"

•Never skip tests during refactoring
•Tests are your safety net
•Without tests, you can't verify it's a refactoring

"Coverage went down but the code is better"

•Better code shouldn't lose coverage
•Investigate why coverage decreased
•Fix or revert