Prepare Dataset
Load, preprocess, and validate datasets for machine learning model training including normalization and augmentation.
When to Use
- •Setting up data pipelines for training
- •Normalizing and cleaning raw data
- •Splitting into train/validation/test sets
- •Applying data augmentation
Quick Reference
python
# Dataset preparation pipeline
class DatasetLoader:
def load(self, path: str) -> Tuple[ndarray, ndarray]:
# Load raw data
pass
def normalize(self, data: ndarray) -> ndarray:
# Normalize to [0, 1] or standardize
pass
def split(self, data: ndarray, ratios: Tuple[float, float, float]):
# Split into train/val/test
pass
def augment(self, data: ndarray) -> ndarray:
# Apply transformations if needed
pass
Workflow
- •Load raw data: Read dataset from file (CSV, HDF5, NumPy)
- •Validate data: Check shape, dtype, missing values
- •Preprocess: Normalize, standardize, encode categorical features
- •Split sets: Create train/validation/test splits
- •Augment data: Apply transformations if needed (rotation, flip, etc.)
Output Format
Dataset preparation report:
- •Raw data shape and statistics
- •Data validation results (missing values, outliers)
- •Preprocessing applied (normalization, encoding)
- •Train/val/test split sizes
- •Final dataset shape and statistics
- •Augmentation transformations applied
References
- •See
extract-hyperparametersskill for data preprocessing config - •See
evaluate-modelskill for test set evaluation - •See
/notes/review/mojo-ml-patterns.mdfor Mojo data loading