AgentSkillsCN

chemometrics-ms-metabolomics

为代谢组学 MS 数据的处理与分析提供专家指导。覆盖从样本归一化到数据处理、多元分析与代谢物鉴定的完整工作流程。适用于处理 LC-MS 或 GC-MS 代谢组学原始数据时使用;进行峰检测、对齐与归一化,应用多元分析(PCA、PLS-DA、OPLS-DA),从 MS/MS 谱图中识别代谢物,或设计临床代谢组学研究时使用。

SKILL.md
--- frontmatter
name: chemometrics-ms-metabolomics
description: >-
  Expert guidance for processing and analyzing mass spectrometry (MS) based metabolomics data.
  Covers the complete workflow from sample normalization through data processing, multivariate
  analysis, and metabolite identification. Use when processing LC-MS or GC-MS metabolomics raw
  data, performing peak detection, alignment, and normalization, applying multivariate analysis
  (PCA, PLS-DA, OPLS-DA), identifying metabolites from MS/MS spectra, or designing clinical
  metabolomics studies.
license: MIT
author: Alban Ott
based-on: Boccard & Rudaz 2018 - Extracting Knowledge from MS Clinical Metabolomic Data

MS Metabolomics Data Processing

Complete workflow from raw MS data to metabolite identification. Load only the reference file relevant to your current task.

Workflow Decision Tree

code
START: What stage are you at?

├─ PRE-ACQUISITION: Sample normalization needed?
│  ├─ Cell cultures → Normalize by cell count or protein content
│  ├─ Urine → Dilute to constant osmolality or creatinine
│  ├─ Blood/Plasma → Use fixed volume (no normalization needed)
│  └─ Tissue → Normalize by wet weight or protein content

├─ DATA ACQUISITION: QC strategy?
│  ├─ Insert pooled QC samples every 5-10 injections
│  ├─ Randomize sample order within batches
│  └─ Include blanks and internal standards

├─ SIGNAL PROCESSING: Raw data to feature table?
│  ├─ Peak detection → XCMS, MZmine, MS-DIAL
│  ├─ Alignment → RT warping + m/z matching
│  └─ Gap filling → Re-integration at missing features

├─ NORMALIZATION: Which method?
│  ├─ Simple → TSN, MSTUS
│  ├─ Reference-based → PQN
│  └─ QC-based → LOESS, QC-RSC, QC-SVRC

├─ SCALING: How to transform data?
│  ├─ Equal importance → Unit variance (UV) scaling
│  ├─ Reduce impact of large peaks → Pareto scaling
│  └─ Reduce heteroscedasticity → Log transformation

├─ MULTIVARIATE ANALYSIS: Which method?
│  ├─ Exploratory → PCA
│  ├─ Discrimination → PLS-DA, OPLS-DA
│  └─ Biomarker selection → S-plot, VIP scores

└─ IDENTIFICATION: What level?
   ├─ Level 1 → Authentic standard (same RT, MS, MS/MS)
   ├─ Level 2 → Library match (MS/MS spectral match)
   ├─ Level 3 → Putative class (characteristic fragments)
   └─ Level 4 → Unknown (unidentified)

When to Use What

Signal Processing (pre-acquisition + raw data to feature table)

Pre-acquisition normalization: Cell count, creatinine, osmolality, fixed volume -- depends on sample type. Peak detection: pyopenms FeatureFindingMetabo pipeline. Mass accuracy 5-20 ppm, S/N > 3-10. RT alignment: LOESS warping or Obiwarp DTW. Tolerance 0.1-0.5 min. Gap filling: Targeted re-integration at expected m/z + RT for missing features. Details: references/signal-processing.md

Post-Acquisition Normalization

TSN: Total signal normalization. Simple but affected by dominant peaks. MSTUS: Median ratio to geometric mean reference. More robust than TSN. PQN: Probabilistic quotient normalization. Best for biological variation. QC-LOESS: Drift correction using pooled QC samples with LOESS smoothing. Missing values: Filter >30% missing, then min/2 or kNN imputation. Batch effects: QC-based correction, randomization, ComBat. Details: references/normalization.md

Data Scaling

MethodFormulaEffectBest For
Centeringx - meanCenters dataAll methods
UV (Autoscaling)(x - mean) / stdEqual importanceWhen all features matter equally
Pareto(x - mean) / sqrt(std)Reduce dominant peaksBalanced importance
Log transformlog(x + 1)Reduce heteroscedasticitySkewed distributions
Range scaling(x - min) / (max - min)0-1 rangeNeural networks

Multivariate Analysis

PCA: sklearn.decomposition.PCA. Hotelling's T2 for outlier detection. Unsupervised exploration. PLS-DA: PLSRegression + LabelEncoder. 7-fold CV + permutation (n>=100). Q2>0.5 required. VIP scores: Variable Importance in Projection. VIP > 1 as initial biomarker filter. S-plot: Covariance vs. correlation plot. High |p(cov)| + high |p(corr)| = reliable biomarkers. Validation: Permutation testing (p < 0.05), R2Y-Q2 gap < 0.3, FDR correction for biomarker selection. Details: references/multivariate-analysis.md

Cross-reference: ../chemometrics-shared/references/validation-strategies.md

Metabolite Identification

MSI levels: Level 1 (authentic standard) through Level 4 (unknown). Spectral matching: Cosine similarity on normalized MS/MS peaks, mz_tolerance=0.02, min 3 matched peaks. Databases: HMDB, METLIN, MassBank, LipidMaps, KEGG, MoNA. Details: references/metabolite-identification.md

Software Tools

TaskOpen SourceCommercial
Peak detectionXCMS, MZmine, MS-DIALCompound Discoverer
AlignmentXCMS, OpenMSProgenesis QI
StatisticsMetaboAnalyst, scikit-learnSIMCA, Progenesis
IdentificationMS-FINDER, SIRIUSmzCloud, Lipid Search
Pathway analysisMetaboAnalyst, KEGGIPA, MetaCore

See Also

  • chemometrics-shared: Cross-validation strategies, performance metrics, overfitting prevention
  • chemometrics-validation: Model validation best practices for analytical chemistry
  • chemometrics-ml-selection: Machine learning method selection for chemometrics
  • chemometrics-hybrid-modeling: Hybrid mechanistic-ML models

References

  • Boccard, J., & Rudaz, S. (2018). Extracting Knowledge from MS Clinical Metabolomic Data: Processing and Analysis Strategies. Chimia, 72(3), 160-167. doi:10.2533/chimia.2018.160
  • Dunn, W. B., et al. (2011). Procedures for large-scale metabolic profiling. Nature Protocols, 6(7), 1060-1083.
  • Sumner, L. W., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3(3), 211-221.
  • Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119-128.