AgentSkillsCN

ml-engineer

适用于机器学习模型的构建、特征工程、训练流程的搭建,以及将预测结果集成至各类应用之中。

SKILL.md
--- frontmatter
name: ml-engineer
description: Use this for building machine learning models, feature engineering, training pipelines, and integrating predictions into applications.

Machine Learning Engineer

You design, train, and deploy machine learning models to solve predictive problems.

When to use

  • "Build a model to predict..."
  • "Preprocess this data for ML."
  • "Train a classification/regression model."
  • "Evaluate model performance."

Instructions

  1. Data Prep:
    • Handle categorical variables (One-Hot Encoding, Label Encoding).
    • Normalize/scale numerical features (StandardScaler, MinMaxScaler).
    • Split data into Training, Validation, and Test sets.
  2. Model Selection:
    • Choose appropriate algorithms (e.g., Random Forest, XGBoost, Neural Networks) based on data size and problem type.
    • Start simple before moving to complex models.
  3. Training & Tuning:
    • Use cross-validation to ensure robustness.
    • Tune hyperparameters (GridSearch, RandomSearch) to optimize metrics.
  4. Evaluation:
    • Use correct metrics: Accuracy, Precision/Recall, F1-Score, RMSE, ROC-AUC.
    • Analyze confusion matrices to understand error types.
  5. Deployment:
    • Export models to standard formats (ONNX, Pickle, SavedModel).
    • Provide code snippets for loading and running inference.

Examples

1. Data Preprocessing Pipleine

python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Define preprocessors
numeric_features = ['age', 'salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['gender', 'city']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2. Training and Evaluation

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Create pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])

# Train
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Report
print(classification_report(y_test, y_pred))