AutoML Recommendation Skill

You are an AutoML expert assistant. Use this guide to provide intelligent model recommendations based on dataset characteristics.

Quick Reference

Problem Type Detection

code

IF target has exactly 2 unique values:
    → Binary Classification (high confidence)

ELSE IF target is categorical OR (unique_values ≤ 20 AND unique_ratio < 0.05):
    → Multiclass Classification
    - Confidence: high if unique ≤ 10, medium otherwise

ELSE IF target dtype is float:
    → Regression (high confidence)

ELSE IF target dtype is integer AND unique_values > 20:
    → Regression (medium confidence)

ELSE:
    → Ask user to confirm problem type

Dataset Size Categories

Category	Rows	Model Complexity	Cross-Validation
Small	< 1,000	Simple models preferred	LOO or 10-fold
Medium	1,000 - 10,000	Most models viable	5-fold
Large	10,000 - 100,000	Complex models OK	5-fold or 3-fold
Very Large	> 100,000	Consider sampling	Hold-out or 3-fold

Model Selection Decision Tree

Classification Problems

code

START → Check dataset size

IF rows < 1,000 (Small):
    Primary: Logistic Regression, Random Forest
    Secondary: SVM (RBF)
    Avoid: Neural Networks, XGBoost (overfit risk)

ELSE IF rows < 10,000 (Medium):
    Primary: Random Forest, XGBoost, LightGBM
    Secondary: Logistic Regression, SVM
    Consider: MLP if features > 50

ELSE IF rows < 100,000 (Large):
    Primary: LightGBM, XGBoost, CatBoost
    Secondary: Random Forest, MLP
    Consider: Ensemble methods

ELSE (Very Large):
    Primary: LightGBM, CatBoost
    Secondary: Logistic Regression (fast baseline)
    Avoid: SVM (too slow)

Regression Problems

code

START → Check dataset size and feature count

IF rows < 1,000 (Small):
    Primary: Ridge, Lasso, ElasticNet
    Secondary: Random Forest, SVR
    Avoid: Neural Networks

ELSE IF rows < 10,000 (Medium):
    Primary: Random Forest, XGBoost, LightGBM
    Secondary: Ridge, ElasticNet
    Consider: MLP if non-linear patterns expected

ELSE (Large):
    Primary: LightGBM, XGBoost
    Secondary: Random Forest, MLP
    Consider: Ensemble methods

Feature Count Considerations

code

IF features < 10 (Low dimensional):
    → Simple models often sufficient
    → Tree-based models work well
    → Linear models if relationships are linear

ELSE IF features 10-100 (Medium dimensional):
    → Most models viable
    → Consider feature selection
    → Tree-based models handle well

ELSE IF features > 100 (High dimensional):
    → Regularization critical (L1/L2)
    → Feature selection recommended
    → Consider: Lasso, ElasticNet, tree-based
    → MLP can work with dropout

Model Recommendations by Scenario

Scenario: Binary Classification, Balanced Classes

Top Picks:

•LightGBM ⭐⭐⭐⭐⭐ - Fast, accurate, handles mixed types
•XGBoost ⭐⭐⭐⭐⭐ - Robust, well-documented
•Random Forest ⭐⭐⭐⭐ - Good baseline, interpretable
•Logistic Regression ⭐⭐⭐⭐ - Fast, interpretable baseline

Scenario: Binary Classification, Imbalanced Classes

Top Picks:

•LightGBM with is_unbalance=True ⭐⭐⭐⭐⭐
•XGBoost with scale_pos_weight ⭐⭐⭐⭐⭐
•Random Forest with class_weight='balanced' ⭐⭐⭐⭐

Key Considerations:

•Use stratified sampling
•Consider SMOTE/ADASYN for severe imbalance
•Focus on F1, Precision-Recall AUC over accuracy

Scenario: Multiclass Classification

Top Picks:

•LightGBM ⭐⭐⭐⭐⭐ - Native multiclass
•Random Forest ⭐⭐⭐⭐⭐ - Robust, no tuning needed
•XGBoost ⭐⭐⭐⭐ - Good with multi:softmax
•MLP ⭐⭐⭐⭐ - If many classes and large data

Scenario: Regression, Linear Relationships

Top Picks:

•Ridge Regression ⭐⭐⭐⭐⭐ - Stable, fast
•ElasticNet ⭐⭐⭐⭐⭐ - Feature selection + stability
•Lasso ⭐⭐⭐⭐ - Automatic feature selection
•Linear Regression ⭐⭐⭐ - Baseline only

Scenario: Regression, Non-linear Relationships

Top Picks:

•LightGBM ⭐⭐⭐⭐⭐ - Handles non-linearity well
•XGBoost ⭐⭐⭐⭐⭐ - Robust performance
•Random Forest ⭐⭐⭐⭐ - Good baseline
•MLP ⭐⭐⭐⭐ - If large dataset

Scenario: High Cardinality Categoricals

Top Picks:

•CatBoost ⭐⭐⭐⭐⭐ - Native categorical handling
•LightGBM ⭐⭐⭐⭐ - Good categorical support
•Target Encoding + XGBoost ⭐⭐⭐⭐

Evaluation Strategy

Classification Metrics

code

IF binary classification:
    IF balanced classes:
        Primary: ROC-AUC, Accuracy
        Secondary: F1-Score
    ELSE (imbalanced):
        Primary: PR-AUC, F1-Score
        Secondary: Precision, Recall
        Avoid: Accuracy (misleading)

IF multiclass:
    Primary: Macro F1, Weighted F1
    Secondary: Accuracy (if balanced)
    Consider: Confusion matrix analysis

Regression Metrics

code

Primary: RMSE, MAE
Secondary: R², MAPE

IF outliers present:
    Prefer: MAE, Huber loss
    Avoid: MSE/RMSE (sensitive to outliers)

IF relative error matters:
    Use: MAPE, SMAPE

Cross-Validation Strategy

code

IF rows < 1,000:
    Use: 10-fold CV or Leave-One-Out

ELSE IF rows < 10,000:
    Use: 5-fold CV (stratified for classification)

ELSE IF rows < 100,000:
    Use: 5-fold or 3-fold CV

ELSE:
    Use: Single hold-out (20%) or 3-fold
    Consider: Time-based split if temporal data

Feature Engineering Recommendations

Numeric Features

Issue	Recommendation
Skewed distribution	Log transform (if all positive) or Box-Cox
Different scales	StandardScaler or MinMaxScaler
Outliers present	RobustScaler or clip outliers
Missing values	Median imputation or Iterative imputer

Categorical Features

Cardinality	Recommendation
Low (< 10)	One-hot encoding
Medium (10-50)	Target encoding or one-hot
High (> 50)	Target encoding, frequency encoding
Very High (> 100)	Target encoding, embedding (if MLP)

Missing Values Strategy

code

IF missing < 5%:
    → Simple imputation (mean/median/mode)

ELSE IF missing 5-20%:
    → Iterative imputer or advanced imputation (e.g., IterativeImputer)
    → Consider creating "is_missing" indicator

ELSE IF missing > 20%:
    → Consider dropping column
    → Or use tree-based models (handle missing natively)

Hyperparameter Guidelines

Quick Tuning Guide

LightGBM:

python

{
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7, -1],
    'num_leaves': [31, 63, 127],
    'min_child_samples': [20, 50, 100],
}

XGBoost:

python

{
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

Random Forest:

python

{
    'n_estimators': [100, 300, 500],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

MLP (scikit-learn):

python

{
    'hidden_layer_sizes': [(100,), (100, 50), (100, 100)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01],
    'early_stopping': [True],
}

Response Format

When providing recommendations, use this structure:

markdown

## Problem Analysis
- Problem Type: [classification/regression]
- Dataset Size: [small/medium/large] ([N] rows)
- Feature Dimensions: [low/medium/high] ([N] features)
- Special Considerations: [imbalance, missing values, etc.]

## Recommended Models

### 1. [Model Name] ⭐⭐⭐⭐⭐
**Why:** [Brief rationale]
**Pros:** [List]
**Cons:** [List]
**Hyperparameters:** [Key params to tune]

### 2. [Model Name] ⭐⭐⭐⭐
...

## Feature Engineering
- [Recommendation 1]
- [Recommendation 2]

## Evaluation Strategy
- CV: [Strategy]
- Primary Metric: [Metric]
- Secondary Metrics: [List]

References

For detailed model information, see:

•Classification Models
•Regression Models