AgentSkillsCN

chemometrics-validation

为化学计量学模型的验证提供最佳实践,包括交叉验证策略、针对光谱学与分析化学的专用性能指标、小数据集的处理方法,以及过拟合的预防措施。

SKILL.md
--- frontmatter
name: chemometrics-validation
description: Best practices for validating chemometric models including cross-validation strategies, performance metrics specific to spectroscopy and analytical chemistry, handling small datasets, and preventing overfitting.
license: MIT
metadata:
  skill-author: Alban Ott
  based-on: Trinh et al. 2021 - Machine Learning in Chemical Product Engineering

Chemometrics Model Validation

Overview

Proper validation is crucial in chemometrics because models operate in regulated environments with small, high-dimensional datasets where overfitting is a constant risk. Wrong predictions can affect product quality and safety, so rigorous validation underpins every credible calibration.

When to Use This Skill

Use this skill when designing validation strategies, choosing cross-validation schemes, selecting performance metrics, dealing with small datasets (n < 100), preparing models for regulatory submission, or writing methods sections for publications.

Quick Reference

Cross-Validation Decision Tree

code
What is your sample size?

├─ n < 20: Leave-One-Out Cross-Validation (LOOCV)
│          • Use with caution (high variance)
│          • Consider repeated random splits instead
│
├─ 20 ≤ n < 50: Leave-One-Out or 5-Fold CV
│               • LOOCV for very small data
│               • 5-Fold with multiple repetitions
│
├─ 50 ≤ n < 200: 5-Fold or 10-Fold CV
│                • 10-Fold is standard
│                • Repeat 3-10 times for stability
│
└─ n ≥ 200: 10-Fold CV or Hold-Out Test Set
            • Can afford 70/30 or 80/20 train/test split
            • Use 10-Fold CV for model selection
            • Report performance on held-out test set

Special considerations:

• Time series data? → Use TimeSeriesSplit (no future leakage)
• Batches/groups? → Use GroupKFold (keep groups together)
• Imbalanced classes? → Use StratifiedKFold (preserve class ratios)
• Spatial data? → Use spatial cross-validation (geographic splits)

Performance Metrics Quick Guide

Regression:

  • Primary: RMSEP (Root Mean Square Error of Prediction)
  • Supporting: R², RPD, Bias, SEP

Classification:

  • Primary: Sensitivity, Specificity, F1-score
  • Supporting: Accuracy, Confusion Matrix, ROC AUC

Cross-Validation Strategies

For detailed strategies with code examples, see: ../chemometrics-shared/references/validation-strategies.md

Covers: Train/Test Split, K-Fold CV, LOOCV, Monte Carlo CV, Time Series CV, Group CV, Repeated CV, Nested CV.

Performance Metrics

For detailed metric definitions, code, and interpretation guidance, see: ../chemometrics-shared/references/performance-metrics.md

Covers: RMSEP, R², RPD, Bias/SEP, complete regression report, Confusion Matrix, Sensitivity/Specificity, F1-Score, ROC AUC, complete classification report.

Preventing Overfitting

For detection techniques, learning/validation curves, and prevention strategies, see: ../chemometrics-shared/references/overfitting-prevention.md

Covers: Train-test gap detection, learning curves, validation curves, regularization, permutation testing, early stopping.

Small Dataset Strategies

For sample-size-aware model selection and validation approaches, see: ../chemometrics-shared/references/sample-size-guidance.md

Covers: Decision tree by sample size, rules of thumb for components and features, bootstrap validation, data augmentation, transfer learning.

Reporting Standards

For publication-ready reporting templates and checklists, see: ../chemometrics-shared/references/reporting-standards.md

Covers: Minimum requirements for publications, example methods section template, reproducibility guidelines.

See Also

  • chemometrics-ml-selection — Choosing appropriate ML methods
  • chemometrics-preprocessing — Spectral preprocessing techniques
  • chemometrics-shared — Shared reference library for all chemometrics skills

References

  • Trinh et al. (2021). Machine Learning in Chemical Product Engineering. Processes, 9(8), 1456.
  • Saeys et al. (2005). Potential for On-Site Analysis of Hog Manure. Biosystems Engineering, 91(4), 393-402.
  • Brereton (2015). Chemometrics for Pattern Recognition. Wiley.
  • Williams & Norris (2001). Near-Infrared Technology in the Agricultural and Food Industries (2nd ed.). AACC International.