AgentSkillsCN

Ml Config System

ML 配置系统

SKILL.md

ML Configuration System

A dataclass-based hierarchical configuration system for ML experiments with integrated logging, checkpointing, and TensorBoard support.

Overview

This skill documents patterns for creating reproducible ML experiment configurations using Python dataclasses. The system provides:

  • Hierarchical composition: Modular sub-configs aggregated into experiment configs
  • Factory pattern: Registry-based config instantiation
  • Immutable updates: Using dataclasses.replace() for safe field modifications
  • JSON serialization: Config snapshots saved for reproducibility
  • Integrated utilities: Console logging, checkpointing, and TensorBoard metrics

Design Patterns

1. Dataclass Hierarchy

Define modular sub-configs for each concern, then aggregate in a base config:

python
from dataclasses import dataclass, field, asdict, replace

@dataclass
class Device:
    """Device configuration."""
    type: str = "auto"
    mixed_precision: bool = False

@dataclass
class Model:
    """Model architecture configuration."""
    d_model: int = 64
    num_layers: int = 1
    dropout: float = 0.0

@dataclass
class Hyperparameters:
    """Training hyperparameters."""
    num_epochs: int = 30
    batch_size: int = 32
    learning_rate: float = 0.001

@dataclass
class BaseConfig:
    """Base configuration - all experiments inherit from this."""
    name: str = "base"
    seed: int = 42
    device: Device = field(default_factory=Device)
    model: Model = field(default_factory=Model)
    hyperparameters: Hyperparameters = field(default_factory=Hyperparameters)

2. Experiment Configs with __post_init__

Create experiment-specific configs by inheriting from BaseConfig and overriding defaults in __post_init__:

python
@dataclass
class MyExperimentConfig(BaseConfig):
    """Custom experiment configuration."""
    name: str = "my_experiment"
    description: str = "Description of the experiment"

    def __post_init__(self):
        # Override model defaults
        self.model = replace(self.model,
            d_model=128,
            num_layers=3,
            dropout=0.1
        )
        # Override hyperparameters
        self.hyperparameters = replace(self.hyperparameters,
            num_epochs=50,
            batch_size=64
        )

3. Factory Pattern with Registry

Use a registry dict to map config names to classes:

python
CONFIGS = {
    "base": BaseConfig,
    "my_experiment": MyExperimentConfig,
}

def get_config(name: str = "base") -> BaseConfig:
    """Get an experiment configuration by name."""
    if name not in CONFIGS:
        raise ValueError(f"Unknown config: {name}. Available: {list(CONFIGS.keys())}")
    return CONFIGS[name]()

4. JSON Serialization

Add a to_dict() method for JSON-compatible serialization:

python
def to_dict(self) -> Dict[str, Any]:
    """Convert config to dictionary for JSON serialization."""
    d = asdict(self)
    # Handle non-JSON-serializable types (e.g., tuples)
    d['hyperparameters']['optimizer_betas'] = list(d['hyperparameters']['optimizer_betas'])
    return d

Configuration Components

The system includes these standard sub-configs:

ComponentPurposeKey Fields
DeviceHardware settingstype, mixed_precision
OutputOutput directory structurebase_dir, save_config, subdirs
ConsoleConsole loggingenabled, filename, tee_to_console
CheckpointingModel savingsave_best, save_last, metric, mode
TensorBoardMetrics loggingenabled, per_batch_*, per_epoch_*
ModelArchitecture paramsd_model, num_layers, num_heads
HyperparametersTraining paramsnum_epochs, batch_size, learning_rate

Supporting Utilities

ConfigLoader

Loads configs and creates run directories:

python
loader = ConfigLoader()
cfg = loader.get_experiment_config('my_experiment')
run_dir = loader.create_run_directory(cfg)

CheckpointManager

Manages model checkpointing during training:

python
ckpt_mgr = CheckpointManager(cfg, run_dir, model)

# Call at end of each epoch
ckpt_mgr.step(epoch, {'dev_accuracy': 0.95})

# Call at end of training
ckpt_mgr.save_final(epoch, metrics)

ConsoleLogger

Captures stdout/stderr to log files:

python
with ConsoleLogger(config=cfg.get('console', {}), run_dir=run_dir):
    # All print statements captured to console.log
    train_model(...)

MetricsLogger

TensorBoard integration for metrics:

python
logger = MetricsLogger(cfg.get('tensorboard', {}), 'experiment_name', output_dir=run_dir)

# Per-batch logging
logger.log_batch(loss=0.5, gradient_norm=1.2)

# Per-epoch logging
logger.log_epoch(epoch, train_loss=0.3, dev_accuracy=0.95)

# Final inference outputs
logger.log_inference(attention_maps=attn, predictions=preds)

logger.close()

Usage Examples

Creating a New Experiment Config

python
@dataclass
class LargeModelConfig(BaseConfig):
    name: str = "large_model"
    description: str = "Larger model with more capacity"

    def __post_init__(self):
        self.model = replace(self.model,
            d_model=256,
            d_internal=256,
            num_layers=4,
            num_heads=8,
            dropout=0.1
        )
        self.hyperparameters = replace(self.hyperparameters,
            num_epochs=100,
            batch_size=128,
            scheduler_enabled=True
        )

# Register the config
CONFIGS["large_model"] = LargeModelConfig

Full Training Loop Integration

python
# Load config
loader = ConfigLoader()
cfg = loader.get_experiment_config('my_experiment')
run_dir = loader.create_run_directory(cfg)

# Initialize utilities
console_logger = ConsoleLogger(cfg.get('console', {}), run_dir)
metrics_logger = MetricsLogger(cfg.get('tensorboard', {}), cfg['name'], output_dir=run_dir)
ckpt_mgr = CheckpointManager(cfg, run_dir, model)

# Training loop
for epoch in range(cfg['hyperparameters']['num_epochs']):
    train_loss = train_epoch(model, train_loader)
    dev_acc = evaluate(model, dev_loader)

    metrics_logger.log_epoch(epoch, train_loss=train_loss, dev_accuracy=dev_acc)
    ckpt_mgr.step(epoch, {'dev_accuracy': dev_acc})

# Cleanup
ckpt_mgr.save_final(epoch)
metrics_logger.close()
console_logger.close()

Template Files

Reference implementations are provided in this skill directory:

  • config_template.py - Dataclass definitions and BaseConfig pattern
  • config_loader_template.py - ConfigLoader and CheckpointManager classes
  • console_logger_template.py - TeeStream and ConsoleLogger classes
  • metrics_logger_template.py - MetricsLogger with TensorBoard integration