MaxFuse Parameter Scaling - Research Notes
Experiment Overview
| Item | Details |
|---|---|
| Date | 2025-01-20 |
| Goal | Adapt MaxFuse parameters for 59-marker PhenoCycler dataset (originally tuned for 26-marker CODEX) |
| Environment | Python 3.10, MaxFuse local package, scanpy 1.9+ |
| Status | Success |
Context
MaxFuse was originally configured for a 26-marker CODEX dataset with ~1,284 RNA cells and ~1.7M protein cells. When migrating to a 59-marker PhenoCycler dataset, the default parameters caused CCA overfitting because:
- •More protein features (59 vs 26) means more dimensions for CCA to fit
- •Same number of RNA cells (~1,284) means higher risk of fitting noise
- •CCA with too many components finds trivially perfect correlations by fitting noise
Critical Issue: CCA Overfitting
The most dangerous parameter is cca_components. With 25 CCA components and 59 protein features, CCA can find nearly perfect correlations that don't reflect real biological signal.
Rule of thumb: cca_components = min(n_shared - 1, sqrt(n_prot_active) + 1, hard_cap)
For 59 markers with 26 shared features:
- •
n_prot_active = 59 - 26 = 33 - •
sqrt(33) + 1 ≈ 7 - •Use 7 CCA components, not 25
Verified Parameters
Parameter Scaling Table
| Parameter | 26 Markers | 59 Markers | 19 Markers | Scaling Rule |
|---|---|---|---|---|
| CCA Components | 25 | 7 | 10 | min(n_shared - 1, 10) for small panels |
| SVD for CCA (RNA) | 40 | 50 | 50 | RNA has many features |
| SVD for CCA (Protein) | None | 35 | 15 | min(15, n_prot - 1) |
| SVD for Graphs (Protein) | 15 | 30 | 15 | min(15, n_prot - 1) |
| SVD for Initial Pivots | 25/20 | 20/18 | 15/15 | min(15, n_shared - 1) |
| Metacell Size | 2 | 2 | 1 | Disable for <25 features |
Critical rule: All SVD components MUST be < n_features - 1.
Small Panel (<25 markers) Considerations
When protein_active == protein_shared (all markers are shared):
- •The sqrt rule for CCA components breaks (
n_prot_active = 0) - •Use simpler rule:
cca_components = min(n_shared - 1, 10) - •Metacell size of 2 provides minimal benefit - use
metacell_size=1(disabled)
Implementation Code
# In notebook 2_integration.ipynb
# 1. Graph construction SVD
svd_comp1_graph = min(50, n_rna_features - 1) # RNA: increased from 40
svd_comp2_graph = min(30, n_prot_features - 1) # Protein: increased from 15
# 2. Initial pivots (shared features only - be conservative)
svd_shared1 = min(20, n_shared - 1) # Reduced from 25
svd_shared2 = min(18, n_shared - 1) # Reduced from 20
# 3. CCA components - CRITICAL
n_prot_active = n_prot_features - n_shared
cca_components = min(
n_shared - 1, # Can't exceed shared features
int(np.sqrt(n_prot_active)) + 1, # sqrt rule for overfitting prevention
12 # Hard cap
)
cca_components = max(5, cca_components) # At least 5
# 4. SVD before CCA
svd_cca_rna = min(50, n_rna_features - 1)
svd_cca_prot = min(35, n_prot_features - 1) # NEW: was None (keep all)
# 5. Propagate - match CCA settings
fusor.propagate(
svd_components1=min(50, n_rna_features - 1),
svd_components2=min(35, n_prot_features - 1),
wt1=0.7, wt2=0.7
)
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Using 25 CCA components with 59 markers | CCA found perfect correlations by fitting noise | Always use sqrt rule for CCA components |
| Keeping SVD=None for protein before CCA | Too many dimensions caused unstable CCA | Limit protein SVD to ~60% of features |
| Same initial pivot SVD (25/20) | Overfitting on shared features in weak linkage | Reduce SVD for initial matching |
Key Insights
- •
CCA components scale with sqrt, not linearly: Doubling protein features does NOT mean doubling CCA components. Use
sqrt(n_active) + 1. - •
Protein SVD limits are essential: With more protein features, you MUST limit SVD before CCA to prevent overfitting. ~60% of features is a good target.
- •
Graph SVD can be more generous: Graph construction is robust to dimensionality. Scale roughly with feature count.
- •
Initial pivots need conservatism: The first matching uses only shared features (~26). Use fewer SVD components to avoid fitting noise.
- •
Check canonical correlations: If the first few canonical correlations are all >0.95, you're overfitting. Reduce CCA components until you see meaningful decay.
Diagnostic Checks
After running with new parameters:
# Check 1: Canonical correlations should decay smoothly
# Good: [0.92, 0.85, 0.78, 0.65, 0.52, ...]
# Bad: [0.99, 0.98, 0.97, 0.96, ...] (overfitting)
# Check 2: Distribution alignment
print(f"RNA: mean={rna.mean():.3f}, std={rna.std():.3f}")
print(f"Prot: mean={prot.mean():.3f}, std={prot.std():.3f}")
# Both should have similar scales (mean≈0, std≈1)
# Check 3: Matching quality
# Score distribution should be centered, not all near 1.0
References
- •MaxFuse paper: Cross-modal data integration methods
- •CCA overfitting: Rule of thumb is n_components < sqrt(min(n1, n2))
- •Integration notebook:
notebooks/2_integration.ipynb