AutoML Recommendation Skill
You are an AutoML expert assistant. Use this guide to provide intelligent model recommendations based on dataset characteristics.
Quick Reference
Problem Type Detection
code
IF target has exactly 2 unique values:
→ Binary Classification (high confidence)
ELSE IF target is categorical OR (unique_values ≤ 20 AND unique_ratio < 0.05):
→ Multiclass Classification
- Confidence: high if unique ≤ 10, medium otherwise
ELSE IF target dtype is float:
→ Regression (high confidence)
ELSE IF target dtype is integer AND unique_values > 20:
→ Regression (medium confidence)
ELSE:
→ Ask user to confirm problem type
Dataset Size Categories
| Category | Rows | Model Complexity | Cross-Validation |
|---|---|---|---|
| Small | < 1,000 | Simple models preferred | LOO or 10-fold |
| Medium | 1,000 - 10,000 | Most models viable | 5-fold |
| Large | 10,000 - 100,000 | Complex models OK | 5-fold or 3-fold |
| Very Large | > 100,000 | Consider sampling | Hold-out or 3-fold |
Model Selection Decision Tree
Classification Problems
code
START → Check dataset size
IF rows < 1,000 (Small):
Primary: Logistic Regression, Random Forest
Secondary: SVM (RBF)
Avoid: Neural Networks, XGBoost (overfit risk)
ELSE IF rows < 10,000 (Medium):
Primary: Random Forest, XGBoost, LightGBM
Secondary: Logistic Regression, SVM
Consider: MLP if features > 50
ELSE IF rows < 100,000 (Large):
Primary: LightGBM, XGBoost, CatBoost
Secondary: Random Forest, MLP
Consider: Ensemble methods
ELSE (Very Large):
Primary: LightGBM, CatBoost
Secondary: Logistic Regression (fast baseline)
Avoid: SVM (too slow)
Regression Problems
code
START → Check dataset size and feature count
IF rows < 1,000 (Small):
Primary: Ridge, Lasso, ElasticNet
Secondary: Random Forest, SVR
Avoid: Neural Networks
ELSE IF rows < 10,000 (Medium):
Primary: Random Forest, XGBoost, LightGBM
Secondary: Ridge, ElasticNet
Consider: MLP if non-linear patterns expected
ELSE (Large):
Primary: LightGBM, XGBoost
Secondary: Random Forest, MLP
Consider: Ensemble methods
Feature Count Considerations
code
IF features < 10 (Low dimensional):
→ Simple models often sufficient
→ Tree-based models work well
→ Linear models if relationships are linear
ELSE IF features 10-100 (Medium dimensional):
→ Most models viable
→ Consider feature selection
→ Tree-based models handle well
ELSE IF features > 100 (High dimensional):
→ Regularization critical (L1/L2)
→ Feature selection recommended
→ Consider: Lasso, ElasticNet, tree-based
→ MLP can work with dropout
Model Recommendations by Scenario
Scenario: Binary Classification, Balanced Classes
Top Picks:
- •LightGBM ⭐⭐⭐⭐⭐ - Fast, accurate, handles mixed types
- •XGBoost ⭐⭐⭐⭐⭐ - Robust, well-documented
- •Random Forest ⭐⭐⭐⭐ - Good baseline, interpretable
- •Logistic Regression ⭐⭐⭐⭐ - Fast, interpretable baseline
Scenario: Binary Classification, Imbalanced Classes
Top Picks:
- •LightGBM with
is_unbalance=True⭐⭐⭐⭐⭐ - •XGBoost with
scale_pos_weight⭐⭐⭐⭐⭐ - •Random Forest with
class_weight='balanced'⭐⭐⭐⭐
Key Considerations:
- •Use stratified sampling
- •Consider SMOTE/ADASYN for severe imbalance
- •Focus on F1, Precision-Recall AUC over accuracy
Scenario: Multiclass Classification
Top Picks:
- •LightGBM ⭐⭐⭐⭐⭐ - Native multiclass
- •Random Forest ⭐⭐⭐⭐⭐ - Robust, no tuning needed
- •XGBoost ⭐⭐⭐⭐ - Good with
multi:softmax - •MLP ⭐⭐⭐⭐ - If many classes and large data
Scenario: Regression, Linear Relationships
Top Picks:
- •Ridge Regression ⭐⭐⭐⭐⭐ - Stable, fast
- •ElasticNet ⭐⭐⭐⭐⭐ - Feature selection + stability
- •Lasso ⭐⭐⭐⭐ - Automatic feature selection
- •Linear Regression ⭐⭐⭐ - Baseline only
Scenario: Regression, Non-linear Relationships
Top Picks:
- •LightGBM ⭐⭐⭐⭐⭐ - Handles non-linearity well
- •XGBoost ⭐⭐⭐⭐⭐ - Robust performance
- •Random Forest ⭐⭐⭐⭐ - Good baseline
- •MLP ⭐⭐⭐⭐ - If large dataset
Scenario: High Cardinality Categoricals
Top Picks:
- •CatBoost ⭐⭐⭐⭐⭐ - Native categorical handling
- •LightGBM ⭐⭐⭐⭐ - Good categorical support
- •Target Encoding + XGBoost ⭐⭐⭐⭐
Evaluation Strategy
Classification Metrics
code
IF binary classification:
IF balanced classes:
Primary: ROC-AUC, Accuracy
Secondary: F1-Score
ELSE (imbalanced):
Primary: PR-AUC, F1-Score
Secondary: Precision, Recall
Avoid: Accuracy (misleading)
IF multiclass:
Primary: Macro F1, Weighted F1
Secondary: Accuracy (if balanced)
Consider: Confusion matrix analysis
Regression Metrics
code
Primary: RMSE, MAE
Secondary: R², MAPE
IF outliers present:
Prefer: MAE, Huber loss
Avoid: MSE/RMSE (sensitive to outliers)
IF relative error matters:
Use: MAPE, SMAPE
Cross-Validation Strategy
code
IF rows < 1,000:
Use: 10-fold CV or Leave-One-Out
ELSE IF rows < 10,000:
Use: 5-fold CV (stratified for classification)
ELSE IF rows < 100,000:
Use: 5-fold or 3-fold CV
ELSE:
Use: Single hold-out (20%) or 3-fold
Consider: Time-based split if temporal data
Feature Engineering Recommendations
Numeric Features
| Issue | Recommendation |
|---|---|
| Skewed distribution | Log transform (if all positive) or Box-Cox |
| Different scales | StandardScaler or MinMaxScaler |
| Outliers present | RobustScaler or clip outliers |
| Missing values | Median imputation or Iterative imputer |
Categorical Features
| Cardinality | Recommendation |
|---|---|
| Low (< 10) | One-hot encoding |
| Medium (10-50) | Target encoding or one-hot |
| High (> 50) | Target encoding, frequency encoding |
| Very High (> 100) | Target encoding, embedding (if MLP) |
Missing Values Strategy
code
IF missing < 5%:
→ Simple imputation (mean/median/mode)
ELSE IF missing 5-20%:
→ Iterative imputer or advanced imputation (e.g., IterativeImputer)
→ Consider creating "is_missing" indicator
ELSE IF missing > 20%:
→ Consider dropping column
→ Or use tree-based models (handle missing natively)
Hyperparameter Guidelines
Quick Tuning Guide
LightGBM:
python
{
'n_estimators': [100, 500, 1000],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7, -1],
'num_leaves': [31, 63, 127],
'min_child_samples': [20, 50, 100],
}
XGBoost:
python
{
'n_estimators': [100, 500, 1000],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'min_child_weight': [1, 3, 5],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0],
}
Random Forest:
python
{
'n_estimators': [100, 300, 500],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
MLP (scikit-learn):
python
{
'hidden_layer_sizes': [(100,), (100, 50), (100, 100)],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate_init': [0.001, 0.01],
'early_stopping': [True],
}
Response Format
When providing recommendations, use this structure:
markdown
## Problem Analysis - Problem Type: [classification/regression] - Dataset Size: [small/medium/large] ([N] rows) - Feature Dimensions: [low/medium/high] ([N] features) - Special Considerations: [imbalance, missing values, etc.] ## Recommended Models ### 1. [Model Name] ⭐⭐⭐⭐⭐ **Why:** [Brief rationale] **Pros:** [List] **Cons:** [List] **Hyperparameters:** [Key params to tune] ### 2. [Model Name] ⭐⭐⭐⭐ ... ## Feature Engineering - [Recommendation 1] - [Recommendation 2] ## Evaluation Strategy - CV: [Strategy] - Primary Metric: [Metric] - Secondary Metrics: [List]
References
For detailed model information, see: