ML Training Validation Scripts
Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
- •Validating training datasets before fine-tuning
- •Checking model checkpoints and outputs
- •Testing end-to-end training pipelines
- •Verifying system dependencies and GPU availability
- •Debugging training failures or data issues
- •Ensuring data format compliance
- •Validating model configurations
- •Testing inference pipelines
Key Resources:
- •
scripts/validate-data.sh- Comprehensive dataset validation - •
scripts/validate-model.sh- Model checkpoint and config validation - •
scripts/test-pipeline.sh- End-to-end pipeline testing - •
scripts/check-dependencies.sh- System and package dependency checking - •
templates/test-config.yaml- Pipeline test configuration template - •
templates/validation-schema.json- Data validation schema template - •
examples/data-validation-example.md- Dataset validation workflow - •
examples/pipeline-testing-example.md- Complete pipeline testing guide
Quick Start
Validate Training Data
bash scripts/validate-data.sh \ --data-path ./data/train.jsonl \ --format jsonl \ --schema templates/validation-schema.json
Validate Model Checkpoint
bash scripts/validate-model.sh \ --model-path ./checkpoints/epoch-3 \ --framework pytorch \ --check-weights
Test Complete Pipeline
bash scripts/test-pipeline.sh \ --config templates/test-config.yaml \ --data ./data/sample.jsonl \ --verbose
Check System Dependencies
bash scripts/check-dependencies.sh \ --framework pytorch \ --gpu-required \ --min-vram 16
Validation Scripts
1. Data Validation (validate-data.sh)
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash scripts/validate-data.sh [OPTIONS]
Options:
- •
--data-path PATH- Path to dataset file or directory (required) - •
--format FORMAT- Data format: jsonl, csv, parquet, arrow (default: jsonl) - •
--schema PATH- Path to validation schema JSON file - •
--sample-size N- Number of samples to validate (default: all) - •
--check-duplicates- Check for duplicate entries - •
--check-null- Check for null/missing values - •
--check-length- Validate text length constraints - •
--check-tokens- Validate tokenization compatibility - •
--tokenizer MODEL- Tokenizer to use for token validation (default: gpt2) - •
--max-length N- Maximum sequence length (default: 2048) - •
--output REPORT- Output validation report path (default: validation-report.json)
Validation Checks:
- •Format Compliance: Verify file format and structure
- •Schema Validation: Check against JSON schema if provided
- •Data Types: Validate field types (string, int, float, etc.)
- •Required Fields: Ensure all required fields are present
- •Value Ranges: Check numeric values within expected ranges
- •Text Quality: Detect empty strings, excessive whitespace, encoding issues
- •Duplicates: Identify duplicate entries (exact or near-duplicate)
- •Token Counts: Verify sequences fit within model context length
- •Label Distribution: Check class balance for classification tasks
- •Missing Values: Detect and report null/missing values
Example Output:
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}
Exit Codes:
- •
0- All validations passed - •
1- Validation errors found - •
2- Script error (invalid arguments, file not found)
2. Model Validation (validate-model.sh)
Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash scripts/validate-model.sh [OPTIONS]
Options:
- •
--model-path PATH- Path to model checkpoint directory (required) - •
--framework FRAMEWORK- Framework: pytorch, tensorflow, jax (default: pytorch) - •
--check-weights- Verify model weights are loadable - •
--check-config- Validate model configuration file - •
--check-tokenizer- Verify tokenizer files are present - •
--check-inference- Test inference with sample input - •
--sample-input TEXT- Sample text for inference test - •
--expected-output TEXT- Expected output for verification - •
--output REPORT- Output validation report path
Validation Checks:
- •File Structure: Verify required files are present
- •Config Validation: Check model_config.json for correctness
- •Weight Integrity: Load and verify model weights
- •Tokenizer Files: Ensure tokenizer files exist and are loadable
- •Model Architecture: Validate architecture matches config
- •Memory Requirements: Estimate GPU/CPU memory needed
- •Inference Test: Run sample inference if requested
- •Quantization: Verify quantized models load correctly
- •LoRA/PEFT: Validate adapter weights and configuration
Example Output:
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}
3. Pipeline Testing (test-pipeline.sh)
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash scripts/test-pipeline.sh [OPTIONS]
Options:
- •
--config PATH- Pipeline test configuration file (required) - •
--data PATH- Sample data for testing - •
--steps STEPS- Comma-separated steps to test (default: all) - •
--verbose- Enable detailed output - •
--output REPORT- Output test report path - •
--fail-fast- Stop on first failure - •
--cleanup- Clean up temporary files after test
Pipeline Steps:
- •data_loading: Test data loading and preprocessing
- •tokenization: Verify tokenization pipeline
- •model_loading: Load model and verify initialization
- •training_step: Run single training step
- •validation_step: Run single validation step
- •checkpoint_save: Test checkpoint saving
- •checkpoint_load: Test checkpoint loading
- •inference: Test inference pipeline
- •metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report ==================== Pipeline: llama-7b-finetuning Started: 2025-11-01 12:34:56 Duration: 127 seconds Test Results: ✓ data_loading (2.3s) - Loaded 10 samples successfully ✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens ✓ model_loading (8.7s) - Model loaded, 7.2B parameters ✓ training_step (15.4s) - Training step completed, loss: 1.847 ✓ validation_step (12.1s) - Validation step completed, loss: 1.923 ✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test ✓ checkpoint_load (6.8s) - Checkpoint loaded successfully ✓ inference (2.9s) - Inference completed, latency: 142ms ✓ metrics (0.4s) - Metrics calculated correctly Overall: PASS (8/8 tests passed) Performance Metrics: - Total time: 127s - GPU memory used: 15.2 GB - CPU memory used: 8.4 GB Recommendations: - Pipeline is ready for full training - Consider increasing batch_size to improve throughput
4. Dependency Checking (check-dependencies.sh)
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash scripts/check-dependencies.sh [OPTIONS]
Options:
- •
--framework FRAMEWORK- ML framework: pytorch, tensorflow, jax (default: pytorch) - •
--gpu-required- Require GPU availability - •
--min-vram GB- Minimum GPU VRAM required (GB) - •
--cuda-version VERSION- Required CUDA version - •
--packages FILE- Path to requirements.txt or packages list - •
--platform PLATFORM- Platform: modal, lambda, runpod, local (default: local) - •
--fix- Attempt to install missing packages - •
--output REPORT- Output dependency report path
Dependency Checks:
- •Python Version: Verify compatible Python version
- •ML Framework: Check PyTorch/TensorFlow/JAX installation
- •CUDA/cuDNN: Verify CUDA toolkit and cuDNN
- •GPU Availability: Detect and validate GPUs
- •VRAM: Check GPU memory capacity
- •System Packages: Verify system-level dependencies
- •Python Packages: Check required pip packages
- •Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
- •Storage: Check available disk space
- •Network: Test internet connectivity for model downloads
Example Output:
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}
Templates
Validation Schema Template (templates/validation-schema.json)
Defines expected data structure and validation rules for training datasets.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}
Customization:
- •Modify
requiredfields for your use case - •Add custom properties and validation rules
- •Set appropriate length constraints
- •Define label enums for classification
- •Configure tokenizer and max length
Test Configuration Template (templates/test-config.yaml)
Defines pipeline testing parameters and expected behaviors.
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: false
Integration with ML Training Workflow
Pre-Training Validation
# 1. Check system dependencies bash scripts/check-dependencies.sh \ --framework pytorch \ --gpu-required \ --min-vram 16 # 2. Validate training data bash scripts/validate-data.sh \ --data-path ./data/train.jsonl \ --schema templates/validation-schema.json \ --check-duplicates \ --check-tokens # 3. Test pipeline with sample data bash scripts/test-pipeline.sh \ --config templates/test-config.yaml \ --verbose
Post-Training Validation
# 1. Validate model checkpoint bash scripts/validate-model.sh \ --model-path ./checkpoints/final \ --framework pytorch \ --check-weights \ --check-inference # 2. Test inference pipeline bash scripts/test-pipeline.sh \ --config templates/test-config.yaml \ --steps inference,metrics
CI/CD Integration
# .github/workflows/validate-training.yml
- name: Validate Training Data
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema ./validation-schema.json
- name: Test Training Pipeline
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
--config ./test-config.yaml \
--fail-fast
Error Handling and Debugging
Common Validation Failures
Data Validation Failures:
- •Token length exceeded → Truncate or filter long sequences
- •Missing required fields → Fix data preprocessing
- •Duplicate entries → Deduplicate dataset
- •Invalid labels → Clean label data
Model Validation Failures:
- •Missing config files → Ensure complete checkpoint
- •Weight loading errors → Check framework compatibility
- •Tokenizer mismatch → Verify tokenizer version
Pipeline Test Failures:
- •Out of memory → Reduce batch size, enable gradient checkpointing
- •CUDA errors → Check GPU drivers, CUDA version
- •Slow performance → Profile and optimize data loading
Debug Mode
All scripts support verbose debugging:
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
- •Detailed progress logs
- •Stack traces for errors
- •Performance metrics
- •Memory usage statistics
Best Practices
- •Always validate before training - Catch data issues early
- •Use schema validation - Enforce data quality standards
- •Test with sample data first - Verify pipeline before full training
- •Check dependencies on new platforms - Avoid runtime failures
- •Validate checkpoints - Ensure model integrity before deployment
- •Monitor validation metrics - Track data quality over time
- •Automate validation in CI/CD - Prevent bad data from entering pipeline
- •Keep validation schemas updated - Reflect current data requirements
Performance Tips
- •Use
--sample-sizefor quick validation of large datasets - •Run dependency checks once per environment, cache results
- •Parallelize validation with GNU parallel for multi-file datasets
- •Use
--fail-fastin CI/CD to save time - •Cache tokenizers to avoid re-downloading
Troubleshooting
Script Not Found:
# Ensure you're in the correct directory cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help
Permission Denied:
# Make scripts executable chmod +x scripts/*.sh
Missing Dependencies:
# Install required tools pip install jsonschema pandas pyarrow sudo apt-get install jq bc
Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local