ML Configuration System
A dataclass-based hierarchical configuration system for ML experiments with integrated logging, checkpointing, and TensorBoard support.
Overview
This skill documents patterns for creating reproducible ML experiment configurations using Python dataclasses. The system provides:
- •Hierarchical composition: Modular sub-configs aggregated into experiment configs
- •Factory pattern: Registry-based config instantiation
- •Immutable updates: Using
dataclasses.replace()for safe field modifications - •JSON serialization: Config snapshots saved for reproducibility
- •Integrated utilities: Console logging, checkpointing, and TensorBoard metrics
Design Patterns
1. Dataclass Hierarchy
Define modular sub-configs for each concern, then aggregate in a base config:
from dataclasses import dataclass, field, asdict, replace
@dataclass
class Device:
"""Device configuration."""
type: str = "auto"
mixed_precision: bool = False
@dataclass
class Model:
"""Model architecture configuration."""
d_model: int = 64
num_layers: int = 1
dropout: float = 0.0
@dataclass
class Hyperparameters:
"""Training hyperparameters."""
num_epochs: int = 30
batch_size: int = 32
learning_rate: float = 0.001
@dataclass
class BaseConfig:
"""Base configuration - all experiments inherit from this."""
name: str = "base"
seed: int = 42
device: Device = field(default_factory=Device)
model: Model = field(default_factory=Model)
hyperparameters: Hyperparameters = field(default_factory=Hyperparameters)
2. Experiment Configs with __post_init__
Create experiment-specific configs by inheriting from BaseConfig and overriding defaults in __post_init__:
@dataclass
class MyExperimentConfig(BaseConfig):
"""Custom experiment configuration."""
name: str = "my_experiment"
description: str = "Description of the experiment"
def __post_init__(self):
# Override model defaults
self.model = replace(self.model,
d_model=128,
num_layers=3,
dropout=0.1
)
# Override hyperparameters
self.hyperparameters = replace(self.hyperparameters,
num_epochs=50,
batch_size=64
)
3. Factory Pattern with Registry
Use a registry dict to map config names to classes:
CONFIGS = {
"base": BaseConfig,
"my_experiment": MyExperimentConfig,
}
def get_config(name: str = "base") -> BaseConfig:
"""Get an experiment configuration by name."""
if name not in CONFIGS:
raise ValueError(f"Unknown config: {name}. Available: {list(CONFIGS.keys())}")
return CONFIGS[name]()
4. JSON Serialization
Add a to_dict() method for JSON-compatible serialization:
def to_dict(self) -> Dict[str, Any]:
"""Convert config to dictionary for JSON serialization."""
d = asdict(self)
# Handle non-JSON-serializable types (e.g., tuples)
d['hyperparameters']['optimizer_betas'] = list(d['hyperparameters']['optimizer_betas'])
return d
Configuration Components
The system includes these standard sub-configs:
| Component | Purpose | Key Fields |
|---|---|---|
Device | Hardware settings | type, mixed_precision |
Output | Output directory structure | base_dir, save_config, subdirs |
Console | Console logging | enabled, filename, tee_to_console |
Checkpointing | Model saving | save_best, save_last, metric, mode |
TensorBoard | Metrics logging | enabled, per_batch_*, per_epoch_* |
Model | Architecture params | d_model, num_layers, num_heads |
Hyperparameters | Training params | num_epochs, batch_size, learning_rate |
Supporting Utilities
ConfigLoader
Loads configs and creates run directories:
loader = ConfigLoader()
cfg = loader.get_experiment_config('my_experiment')
run_dir = loader.create_run_directory(cfg)
CheckpointManager
Manages model checkpointing during training:
ckpt_mgr = CheckpointManager(cfg, run_dir, model)
# Call at end of each epoch
ckpt_mgr.step(epoch, {'dev_accuracy': 0.95})
# Call at end of training
ckpt_mgr.save_final(epoch, metrics)
ConsoleLogger
Captures stdout/stderr to log files:
with ConsoleLogger(config=cfg.get('console', {}), run_dir=run_dir):
# All print statements captured to console.log
train_model(...)
MetricsLogger
TensorBoard integration for metrics:
logger = MetricsLogger(cfg.get('tensorboard', {}), 'experiment_name', output_dir=run_dir)
# Per-batch logging
logger.log_batch(loss=0.5, gradient_norm=1.2)
# Per-epoch logging
logger.log_epoch(epoch, train_loss=0.3, dev_accuracy=0.95)
# Final inference outputs
logger.log_inference(attention_maps=attn, predictions=preds)
logger.close()
Usage Examples
Creating a New Experiment Config
@dataclass
class LargeModelConfig(BaseConfig):
name: str = "large_model"
description: str = "Larger model with more capacity"
def __post_init__(self):
self.model = replace(self.model,
d_model=256,
d_internal=256,
num_layers=4,
num_heads=8,
dropout=0.1
)
self.hyperparameters = replace(self.hyperparameters,
num_epochs=100,
batch_size=128,
scheduler_enabled=True
)
# Register the config
CONFIGS["large_model"] = LargeModelConfig
Full Training Loop Integration
# Load config
loader = ConfigLoader()
cfg = loader.get_experiment_config('my_experiment')
run_dir = loader.create_run_directory(cfg)
# Initialize utilities
console_logger = ConsoleLogger(cfg.get('console', {}), run_dir)
metrics_logger = MetricsLogger(cfg.get('tensorboard', {}), cfg['name'], output_dir=run_dir)
ckpt_mgr = CheckpointManager(cfg, run_dir, model)
# Training loop
for epoch in range(cfg['hyperparameters']['num_epochs']):
train_loss = train_epoch(model, train_loader)
dev_acc = evaluate(model, dev_loader)
metrics_logger.log_epoch(epoch, train_loss=train_loss, dev_accuracy=dev_acc)
ckpt_mgr.step(epoch, {'dev_accuracy': dev_acc})
# Cleanup
ckpt_mgr.save_final(epoch)
metrics_logger.close()
console_logger.close()
Template Files
Reference implementations are provided in this skill directory:
- •
config_template.py- Dataclass definitions and BaseConfig pattern - •
config_loader_template.py- ConfigLoader and CheckpointManager classes - •
console_logger_template.py- TeeStream and ConsoleLogger classes - •
metrics_logger_template.py- MetricsLogger with TensorBoard integration