Scikit-learn Best Practices

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

Code Style and Structure

•Write concise, technical responses with accurate Python examples
•Prioritize reproducibility in machine learning workflows
•Use functional programming for data pipelines
•Use object-oriented programming for custom estimators
•Prefer vectorized operations over explicit loops
•Follow PEP 8 style guidelines

Machine Learning Workflow

Data Preparation

•Always split data before any preprocessing: train/validation/test
•Use train_test_split() with random_state for reproducibility
•Stratify splits for imbalanced classification: stratify=y
•Keep test set completely separate until final evaluation

Feature Engineering

•Scale features appropriately for distance-based algorithms
•Use StandardScaler for normally distributed features
•Use MinMaxScaler for bounded features
•Use RobustScaler for data with outliers
•Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder
•Handle missing values: SimpleImputer, KNNImputer

Pipelines

•Always use Pipeline to chain preprocessing and modeling
•Prevents data leakage by fitting transformers only on training data
•Makes code cleaner and more reproducible
•Enables easy deployment and serialization

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

•Use ColumnTransformer for different preprocessing per feature type
•Combine numeric and categorical preprocessing in single pipeline

Model Selection and Tuning

Cross-Validation

•Use cross-validation for reliable performance estimates
•cross_val_score() for quick evaluation
•cross_validate() for multiple metrics
•
Use appropriate CV strategy:
- •KFold for regression
- •StratifiedKFold for classification
- •TimeSeriesSplit for temporal data
- •GroupKFold for grouped data

Hyperparameter Tuning

•Use GridSearchCV for exhaustive search
•Use RandomizedSearchCV for large parameter spaces
•Always tune on training/validation data, never test data
•Set n_jobs=-1 for parallel processing

Model Evaluation

Classification Metrics

•
Use appropriate metrics for your problem:
- •accuracy_score for balanced classes
- •precision_score, recall_score, f1_score for imbalanced
- •roc_auc_score for ranking ability
•Use classification_report() for comprehensive overview
•Examine confusion_matrix() for error analysis

Regression Metrics

•mean_squared_error (MSE) for general use
•mean_absolute_error (MAE) for interpretability
•r2_score for explained variance

Evaluation Best Practices

•Report confidence intervals, not just point estimates
•Use multiple metrics to understand model behavior
•Compare against meaningful baselines
•Evaluate on held-out test set only once, at the end

Handling Imbalanced Data

•Use stratified splitting and cross-validation
•Consider class weights: class_weight='balanced'
•Use appropriate metrics (F1, AUC-PR, not accuracy)
•Adjust decision threshold based on business needs

Feature Selection

•Use SelectKBest with statistical tests
•Use RFE (Recursive Feature Elimination)
•Use model-based selection: SelectFromModel
•Examine feature importances from tree-based models

Model Persistence

•Use joblib for saving and loading models
•Save entire pipelines, not just models
•Version control model artifacts
•Document model metadata

Performance Optimization

•Use n_jobs=-1 for parallel processing where available
•Consider warm_start=True for iterative training
•Use sparse matrices for high-dimensional sparse data
•Consider incremental learning with partial_fit() for large data

Key Conventions

•Import from submodules: from sklearn.ensemble import RandomForestClassifier
•Set random_state for reproducibility
•Use pipelines to prevent data leakage
•Document model choices and hyperparameters