MS Metabolomics Data Processing

Complete workflow from raw MS data to metabolite identification. Load only the reference file relevant to your current task.

Workflow Decision Tree

code

START: What stage are you at?

├─ PRE-ACQUISITION: Sample normalization needed?
│  ├─ Cell cultures → Normalize by cell count or protein content
│  ├─ Urine → Dilute to constant osmolality or creatinine
│  ├─ Blood/Plasma → Use fixed volume (no normalization needed)
│  └─ Tissue → Normalize by wet weight or protein content

├─ DATA ACQUISITION: QC strategy?
│  ├─ Insert pooled QC samples every 5-10 injections
│  ├─ Randomize sample order within batches
│  └─ Include blanks and internal standards

├─ SIGNAL PROCESSING: Raw data to feature table?
│  ├─ Peak detection → XCMS, MZmine, MS-DIAL
│  ├─ Alignment → RT warping + m/z matching
│  └─ Gap filling → Re-integration at missing features

├─ NORMALIZATION: Which method?
│  ├─ Simple → TSN, MSTUS
│  ├─ Reference-based → PQN
│  └─ QC-based → LOESS, QC-RSC, QC-SVRC

├─ SCALING: How to transform data?
│  ├─ Equal importance → Unit variance (UV) scaling
│  ├─ Reduce impact of large peaks → Pareto scaling
│  └─ Reduce heteroscedasticity → Log transformation

├─ MULTIVARIATE ANALYSIS: Which method?
│  ├─ Exploratory → PCA
│  ├─ Discrimination → PLS-DA, OPLS-DA
│  └─ Biomarker selection → S-plot, VIP scores

└─ IDENTIFICATION: What level?
   ├─ Level 1 → Authentic standard (same RT, MS, MS/MS)
   ├─ Level 2 → Library match (MS/MS spectral match)
   ├─ Level 3 → Putative class (characteristic fragments)
   └─ Level 4 → Unknown (unidentified)

When to Use What

Signal Processing (pre-acquisition + raw data to feature table)

Pre-acquisition normalization: Cell count, creatinine, osmolality, fixed volume -- depends on sample type. Peak detection: pyopenms FeatureFindingMetabo pipeline. Mass accuracy 5-20 ppm, S/N > 3-10. RT alignment: LOESS warping or Obiwarp DTW. Tolerance 0.1-0.5 min. Gap filling: Targeted re-integration at expected m/z + RT for missing features. Details: references/signal-processing.md

Post-Acquisition Normalization

TSN: Total signal normalization. Simple but affected by dominant peaks. MSTUS: Median ratio to geometric mean reference. More robust than TSN. PQN: Probabilistic quotient normalization. Best for biological variation. QC-LOESS: Drift correction using pooled QC samples with LOESS smoothing. Missing values: Filter >30% missing, then min/2 or kNN imputation. Batch effects: QC-based correction, randomization, ComBat. Details: references/normalization.md

Data Scaling

Method	Formula	Effect	Best For
Centering	x - mean	Centers data	All methods
UV (Autoscaling)	(x - mean) / std	Equal importance	When all features matter equally
Pareto	(x - mean) / sqrt(std)	Reduce dominant peaks	Balanced importance
Log transform	log(x + 1)	Reduce heteroscedasticity	Skewed distributions
Range scaling	(x - min) / (max - min)	0-1 range	Neural networks

Multivariate Analysis

PCA: sklearn.decomposition.PCA. Hotelling's T2 for outlier detection. Unsupervised exploration. PLS-DA: PLSRegression + LabelEncoder. 7-fold CV + permutation (n>=100). Q2>0.5 required. VIP scores: Variable Importance in Projection. VIP > 1 as initial biomarker filter. S-plot: Covariance vs. correlation plot. High |p(cov)| + high |p(corr)| = reliable biomarkers. Validation: Permutation testing (p < 0.05), R2Y-Q2 gap < 0.3, FDR correction for biomarker selection. Details: references/multivariate-analysis.md

Cross-reference: ../chemometrics-shared/references/validation-strategies.md

Metabolite Identification

MSI levels: Level 1 (authentic standard) through Level 4 (unknown). Spectral matching: Cosine similarity on normalized MS/MS peaks, mz_tolerance=0.02, min 3 matched peaks. Databases: HMDB, METLIN, MassBank, LipidMaps, KEGG, MoNA. Details: references/metabolite-identification.md

Software Tools

Task	Open Source	Commercial
Peak detection	XCMS, MZmine, MS-DIAL	Compound Discoverer
Alignment	XCMS, OpenMS	Progenesis QI
Statistics	MetaboAnalyst, scikit-learn	SIMCA, Progenesis
Identification	MS-FINDER, SIRIUS	mzCloud, Lipid Search
Pathway analysis	MetaboAnalyst, KEGG	IPA, MetaCore

References

•Boccard, J., & Rudaz, S. (2018). Extracting Knowledge from MS Clinical Metabolomic Data: Processing and Analysis Strategies. Chimia, 72(3), 160-167. doi:10.2533/chimia.2018.160
•Dunn, W. B., et al. (2011). Procedures for large-scale metabolic profiling. Nature Protocols, 6(7), 1060-1083.
•Sumner, L. W., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3(3), 211-221.
•Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119-128.

chemometrics-ms-metabolomics