Scikit-learn Best Practices
Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.
Code Style and Structure
- •Write concise, technical responses with accurate Python examples
- •Prioritize reproducibility in machine learning workflows
- •Use functional programming for data pipelines
- •Use object-oriented programming for custom estimators
- •Prefer vectorized operations over explicit loops
- •Follow PEP 8 style guidelines
Machine Learning Workflow
Data Preparation
- •Always split data before any preprocessing: train/validation/test
- •Use
train_test_split()withrandom_statefor reproducibility - •Stratify splits for imbalanced classification:
stratify=y - •Keep test set completely separate until final evaluation
Feature Engineering
- •Scale features appropriately for distance-based algorithms
- •Use
StandardScalerfor normally distributed features - •Use
MinMaxScalerfor bounded features - •Use
RobustScalerfor data with outliers - •Encode categorical variables:
OneHotEncoder,OrdinalEncoder,LabelEncoder - •Handle missing values:
SimpleImputer,KNNImputer
Pipelines
- •Always use
Pipelineto chain preprocessing and modeling - •Prevents data leakage by fitting transformers only on training data
- •Makes code cleaner and more reproducible
- •Enables easy deployment and serialization
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
Column Transformers
- •Use
ColumnTransformerfor different preprocessing per feature type - •Combine numeric and categorical preprocessing in single pipeline
Model Selection and Tuning
Cross-Validation
- •Use cross-validation for reliable performance estimates
- •
cross_val_score()for quick evaluation - •
cross_validate()for multiple metrics - •Use appropriate CV strategy:
- •
KFoldfor regression - •
StratifiedKFoldfor classification - •
TimeSeriesSplitfor temporal data - •
GroupKFoldfor grouped data
- •
Hyperparameter Tuning
- •Use
GridSearchCVfor exhaustive search - •Use
RandomizedSearchCVfor large parameter spaces - •Always tune on training/validation data, never test data
- •Set
n_jobs=-1for parallel processing
Model Evaluation
Classification Metrics
- •Use appropriate metrics for your problem:
- •
accuracy_scorefor balanced classes - •
precision_score,recall_score,f1_scorefor imbalanced - •
roc_auc_scorefor ranking ability
- •
- •Use
classification_report()for comprehensive overview - •Examine
confusion_matrix()for error analysis
Regression Metrics
- •
mean_squared_error(MSE) for general use - •
mean_absolute_error(MAE) for interpretability - •
r2_scorefor explained variance
Evaluation Best Practices
- •Report confidence intervals, not just point estimates
- •Use multiple metrics to understand model behavior
- •Compare against meaningful baselines
- •Evaluate on held-out test set only once, at the end
Handling Imbalanced Data
- •Use stratified splitting and cross-validation
- •Consider class weights:
class_weight='balanced' - •Use appropriate metrics (F1, AUC-PR, not accuracy)
- •Adjust decision threshold based on business needs
Feature Selection
- •Use
SelectKBestwith statistical tests - •Use
RFE(Recursive Feature Elimination) - •Use model-based selection:
SelectFromModel - •Examine feature importances from tree-based models
Model Persistence
- •Use
joblibfor saving and loading models - •Save entire pipelines, not just models
- •Version control model artifacts
- •Document model metadata
Performance Optimization
- •Use
n_jobs=-1for parallel processing where available - •Consider
warm_start=Truefor iterative training - •Use sparse matrices for high-dimensional sparse data
- •Consider incremental learning with
partial_fit()for large data
Key Conventions
- •Import from submodules:
from sklearn.ensemble import RandomForestClassifier - •Set
random_statefor reproducibility - •Use pipelines to prevent data leakage
- •Document model choices and hyperparameters