AgentSkillsCN

Automl

从项目初始接收到最终交付,进行全面的项目审计与差距分析。适用于分析任何软件项目,以明确当前已有的内容(100%)、缺失的部分(100%),并以100%的信心与准确度完成项目。

SKILL.md

AutoML Recommendation Skill

You are an AutoML expert assistant. Use this guide to provide intelligent model recommendations based on dataset characteristics.

Quick Reference

Problem Type Detection

code
IF target has exactly 2 unique values:
    → Binary Classification (high confidence)

ELSE IF target is categorical OR (unique_values ≤ 20 AND unique_ratio < 0.05):
    → Multiclass Classification
    - Confidence: high if unique ≤ 10, medium otherwise

ELSE IF target dtype is float:
    → Regression (high confidence)

ELSE IF target dtype is integer AND unique_values > 20:
    → Regression (medium confidence)

ELSE:
    → Ask user to confirm problem type

Dataset Size Categories

CategoryRowsModel ComplexityCross-Validation
Small< 1,000Simple models preferredLOO or 10-fold
Medium1,000 - 10,000Most models viable5-fold
Large10,000 - 100,000Complex models OK5-fold or 3-fold
Very Large> 100,000Consider samplingHold-out or 3-fold

Model Selection Decision Tree

Classification Problems

code
START → Check dataset size

IF rows < 1,000 (Small):
    Primary: Logistic Regression, Random Forest
    Secondary: SVM (RBF)
    Avoid: Neural Networks, XGBoost (overfit risk)

ELSE IF rows < 10,000 (Medium):
    Primary: Random Forest, XGBoost, LightGBM
    Secondary: Logistic Regression, SVM
    Consider: MLP if features > 50

ELSE IF rows < 100,000 (Large):
    Primary: LightGBM, XGBoost, CatBoost
    Secondary: Random Forest, MLP
    Consider: Ensemble methods

ELSE (Very Large):
    Primary: LightGBM, CatBoost
    Secondary: Logistic Regression (fast baseline)
    Avoid: SVM (too slow)

Regression Problems

code
START → Check dataset size and feature count

IF rows < 1,000 (Small):
    Primary: Ridge, Lasso, ElasticNet
    Secondary: Random Forest, SVR
    Avoid: Neural Networks

ELSE IF rows < 10,000 (Medium):
    Primary: Random Forest, XGBoost, LightGBM
    Secondary: Ridge, ElasticNet
    Consider: MLP if non-linear patterns expected

ELSE (Large):
    Primary: LightGBM, XGBoost
    Secondary: Random Forest, MLP
    Consider: Ensemble methods

Feature Count Considerations

code
IF features < 10 (Low dimensional):
    → Simple models often sufficient
    → Tree-based models work well
    → Linear models if relationships are linear

ELSE IF features 10-100 (Medium dimensional):
    → Most models viable
    → Consider feature selection
    → Tree-based models handle well

ELSE IF features > 100 (High dimensional):
    → Regularization critical (L1/L2)
    → Feature selection recommended
    → Consider: Lasso, ElasticNet, tree-based
    → MLP can work with dropout

Model Recommendations by Scenario

Scenario: Binary Classification, Balanced Classes

Top Picks:

  1. LightGBM ⭐⭐⭐⭐⭐ - Fast, accurate, handles mixed types
  2. XGBoost ⭐⭐⭐⭐⭐ - Robust, well-documented
  3. Random Forest ⭐⭐⭐⭐ - Good baseline, interpretable
  4. Logistic Regression ⭐⭐⭐⭐ - Fast, interpretable baseline

Scenario: Binary Classification, Imbalanced Classes

Top Picks:

  1. LightGBM with is_unbalance=True ⭐⭐⭐⭐⭐
  2. XGBoost with scale_pos_weight ⭐⭐⭐⭐⭐
  3. Random Forest with class_weight='balanced' ⭐⭐⭐⭐

Key Considerations:

  • Use stratified sampling
  • Consider SMOTE/ADASYN for severe imbalance
  • Focus on F1, Precision-Recall AUC over accuracy

Scenario: Multiclass Classification

Top Picks:

  1. LightGBM ⭐⭐⭐⭐⭐ - Native multiclass
  2. Random Forest ⭐⭐⭐⭐⭐ - Robust, no tuning needed
  3. XGBoost ⭐⭐⭐⭐ - Good with multi:softmax
  4. MLP ⭐⭐⭐⭐ - If many classes and large data

Scenario: Regression, Linear Relationships

Top Picks:

  1. Ridge Regression ⭐⭐⭐⭐⭐ - Stable, fast
  2. ElasticNet ⭐⭐⭐⭐⭐ - Feature selection + stability
  3. Lasso ⭐⭐⭐⭐ - Automatic feature selection
  4. Linear Regression ⭐⭐⭐ - Baseline only

Scenario: Regression, Non-linear Relationships

Top Picks:

  1. LightGBM ⭐⭐⭐⭐⭐ - Handles non-linearity well
  2. XGBoost ⭐⭐⭐⭐⭐ - Robust performance
  3. Random Forest ⭐⭐⭐⭐ - Good baseline
  4. MLP ⭐⭐⭐⭐ - If large dataset

Scenario: High Cardinality Categoricals

Top Picks:

  1. CatBoost ⭐⭐⭐⭐⭐ - Native categorical handling
  2. LightGBM ⭐⭐⭐⭐ - Good categorical support
  3. Target Encoding + XGBoost ⭐⭐⭐⭐

Evaluation Strategy

Classification Metrics

code
IF binary classification:
    IF balanced classes:
        Primary: ROC-AUC, Accuracy
        Secondary: F1-Score
    ELSE (imbalanced):
        Primary: PR-AUC, F1-Score
        Secondary: Precision, Recall
        Avoid: Accuracy (misleading)

IF multiclass:
    Primary: Macro F1, Weighted F1
    Secondary: Accuracy (if balanced)
    Consider: Confusion matrix analysis

Regression Metrics

code
Primary: RMSE, MAE
Secondary: R², MAPE

IF outliers present:
    Prefer: MAE, Huber loss
    Avoid: MSE/RMSE (sensitive to outliers)

IF relative error matters:
    Use: MAPE, SMAPE

Cross-Validation Strategy

code
IF rows < 1,000:
    Use: 10-fold CV or Leave-One-Out

ELSE IF rows < 10,000:
    Use: 5-fold CV (stratified for classification)

ELSE IF rows < 100,000:
    Use: 5-fold or 3-fold CV

ELSE:
    Use: Single hold-out (20%) or 3-fold
    Consider: Time-based split if temporal data

Feature Engineering Recommendations

Numeric Features

IssueRecommendation
Skewed distributionLog transform (if all positive) or Box-Cox
Different scalesStandardScaler or MinMaxScaler
Outliers presentRobustScaler or clip outliers
Missing valuesMedian imputation or Iterative imputer

Categorical Features

CardinalityRecommendation
Low (< 10)One-hot encoding
Medium (10-50)Target encoding or one-hot
High (> 50)Target encoding, frequency encoding
Very High (> 100)Target encoding, embedding (if MLP)

Missing Values Strategy

code
IF missing < 5%:
    → Simple imputation (mean/median/mode)

ELSE IF missing 5-20%:
    → Iterative imputer or advanced imputation (e.g., IterativeImputer)
    → Consider creating "is_missing" indicator

ELSE IF missing > 20%:
    → Consider dropping column
    → Or use tree-based models (handle missing natively)

Hyperparameter Guidelines

Quick Tuning Guide

LightGBM:

python
{
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7, -1],
    'num_leaves': [31, 63, 127],
    'min_child_samples': [20, 50, 100],
}

XGBoost:

python
{
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

Random Forest:

python
{
    'n_estimators': [100, 300, 500],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

MLP (scikit-learn):

python
{
    'hidden_layer_sizes': [(100,), (100, 50), (100, 100)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01],
    'early_stopping': [True],
}

Response Format

When providing recommendations, use this structure:

markdown
## Problem Analysis
- Problem Type: [classification/regression]
- Dataset Size: [small/medium/large] ([N] rows)
- Feature Dimensions: [low/medium/high] ([N] features)
- Special Considerations: [imbalance, missing values, etc.]

## Recommended Models

### 1. [Model Name] ⭐⭐⭐⭐⭐
**Why:** [Brief rationale]
**Pros:** [List]
**Cons:** [List]
**Hyperparameters:** [Key params to tune]

### 2. [Model Name] ⭐⭐⭐⭐
...

## Feature Engineering
- [Recommendation 1]
- [Recommendation 2]

## Evaluation Strategy
- CV: [Strategy]
- Primary Metric: [Metric]
- Secondary Metrics: [List]

References

For detailed model information, see: