Machine Learning Regressor (RF & XGBoost)
Advanced regression techniques using ensemble learning methods for complex, nonlinear relationships.
When to Use
- •Nonlinear Relationships: When linear regression assumptions are violated.
- •Complex Interactions: Multiple variables interact in non-obvious ways.
- •Feature Importance: Need to identify which factors drive the outcome (critical for O-Prize papers).
- •High Accuracy Required: When prediction precision is paramount.
- •Sufficient Data: At least 100+ samples (preferably 500+) for reliable training.
When NOT to Use
- •Small Samples: < 50 points → Use
grey-forecasteror simple regression. - •Time Series: Use
arima-forecasteror LSTM instead (ML methods don't capture temporal dependencies well). - •Need Causality: ML shows correlation, not causation. Use structural equation models if causality is critical.
Model Comparison
| Model | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Random Forest | Robust, less overfitting, fast training | Lower accuracy than XGBoost | Exploratory analysis, interpretability |
| XGBoost | Highest accuracy, handles missing values | Slower, more hyperparameters | Final predictions, competitions |
Algorithm Overview
Random Forest
- •Builds multiple decision trees on random subsets of data and features.
- •Averages predictions to reduce variance.
- •Naturally provides feature importance via mean decrease in impurity.
XGBoost (Extreme Gradient Boosting)
- •Sequentially builds trees, each correcting errors of previous ones.
- •Uses gradient descent optimization.
- •Includes regularization to prevent overfitting.
Implementation Template
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')
class MLRegressor:
"""
Unified interface for Random Forest and XGBoost regression
"""
def __init__(self, model_type='rf', random_state=42):
"""
Args:
model_type (str): 'rf' for Random Forest or 'xgb' for XGBoost
random_state (int): Random seed for reproducibility
"""
self.model_type = model_type
self.random_state = random_state
self.model = None
self.feature_names = None
self.best_params = None
def fit(self, X, y, tune_hyperparams=True, cv_folds=5):
"""
Train the model with optional hyperparameter tuning
Args:
X (pd.DataFrame or np.array): Features
y (pd.Series or np.array): Target variable
tune_hyperparams (bool): Whether to perform hyperparameter search
cv_folds (int): Number of cross-validation folds
"""
# Store feature names
if isinstance(X, pd.DataFrame):
self.feature_names = X.columns.tolist()
X = X.values
else:
self.feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
if isinstance(y, pd.Series):
y = y.values
# Initialize model
if self.model_type == 'rf':
base_model = RandomForestRegressor(random_state=self.random_state)
param_dist = {
'n_estimators': [100, 200, 300, 500],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
else: # xgb
base_model = xgb.XGBRegressor(random_state=self.random_state,
objective='reg:squarederror')
param_dist = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7, 9],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'min_child_weight': [1, 3, 5]
}
# Hyperparameter tuning
if tune_hyperparams:
print(f"🔍 Tuning hyperparameters for {self.model_type.upper()}...")
search = RandomizedSearchCV(
base_model,
param_distributions=param_dist,
n_iter=20, # Try 20 random combinations
cv=cv_folds,
scoring='r2',
random_state=self.random_state,
n_jobs=-1,
verbose=1
)
search.fit(X, y)
self.model = search.best_estimator_
self.best_params = search.best_params_
print(f"✅ Best parameters: {self.best_params}")
else:
# Use default parameters
self.model = base_model
self.model.fit(X, y)
return self
def predict(self, X):
"""Make predictions"""
if isinstance(X, pd.DataFrame):
X = X.values
return self.model.predict(X)
def evaluate(self, X_test, y_test):
"""
Evaluate model performance
Returns:
dict: Metrics including R², RMSE, MAE
"""
y_pred = self.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
return {
'r2': r2,
'rmse': rmse,
'mae': mae,
'predictions': y_pred
}
def cross_validate(self, X, y, cv_folds=5):
"""
Perform cross-validation
Returns:
dict: Mean and std of R² scores
"""
if isinstance(X, pd.DataFrame):
X = X.values
if isinstance(y, pd.Series):
y = y.values
scores = cross_val_score(self.model, X, y, cv=cv_folds,
scoring='r2', n_jobs=-1)
return {
'mean_r2': scores.mean(),
'std_r2': scores.std(),
'all_scores': scores
}
def get_feature_importance(self):
"""
Extract feature importance
Returns:
pd.DataFrame: Features ranked by importance
"""
if self.model_type == 'rf':
importance = self.model.feature_importances_
else: # xgb
importance = self.model.feature_importances_
importance_df = pd.DataFrame({
'feature': self.feature_names,
'importance': importance
}).sort_values('importance', ascending=False)
return importance_df
def plot_feature_importance(importance_df, top_n=10, title='Feature Importance'):
"""Visualize top N most important features"""
plt.figure(figsize=(10, 6))
top_features = importance_df.head(top_n)
plt.barh(range(len(top_features)), top_features['importance'],
color='steelblue', alpha=0.8)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance Score', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()
def plot_predictions(y_true, y_pred, title='Actual vs Predicted'):
"""Scatter plot of predictions vs actual values"""
plt.figure(figsize=(8, 8))
plt.scatter(y_true, y_pred, alpha=0.6, edgecolors='k', s=50)
# Perfect prediction line
min_val = min(y_true.min(), y_pred.min())
max_val = max(y_true.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('prediction_scatter.png', dpi=300)
plt.show()
# --- Usage Example ---
if __name__ == "__main__":
# Mock Data: Housing prices prediction
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='Price')
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=" * 60)
print("RANDOM FOREST REGRESSION")
print("=" * 60)
# Train Random Forest
rf_model = MLRegressor(model_type='rf')
rf_model.fit(X_train, y_train, tune_hyperparams=True)
# Evaluate
rf_results = rf_model.evaluate(X_test, y_test)
print(f"\nTest Set Performance:")
print(f" R² Score: {rf_results['r2']:.4f}")
print(f" RMSE: {rf_results['rmse']:.4f}")
print(f" MAE: {rf_results['mae']:.4f}")
# Cross-validation
cv_results = rf_model.cross_validate(X_train, y_train)
print(f"\nCross-Validation (5-fold):")
print(f" Mean R²: {cv_results['mean_r2']:.4f} ± {cv_results['std_r2']:.4f}")
# Feature importance
importance = rf_model.get_feature_importance()
print("\nTop 5 Most Important Features:")
print(importance.head())
# Visualizations
plot_feature_importance(importance, top_n=8, title='Random Forest Feature Importance')
plot_predictions(y_test, rf_results['predictions'], title='RF: Actual vs Predicted')
# ========== XGBoost ==========
print("\n" + "=" * 60)
print("XGBOOST REGRESSION")
print("=" * 60)
xgb_model = MLRegressor(model_type='xgb')
xgb_model.fit(X_train, y_train, tune_hyperparams=True)
xgb_results = xgb_model.evaluate(X_test, y_test)
print(f"\nTest Set Performance:")
print(f" R² Score: {xgb_results['r2']:.4f}")
print(f" RMSE: {xgb_results['rmse']:.4f}")
print(f" MAE: {xgb_results['mae']:.4f}")
# Feature importance
xgb_importance = xgb_model.get_feature_importance()
plot_feature_importance(xgb_importance, top_n=8, title='XGBoost Feature Importance')
Hyperparameter Tuning Guide
Random Forest Key Parameters
- •n_estimators: Number of trees (100-500). More is better but slower.
- •max_depth: Tree depth. Too deep = overfitting. Start with 20-30.
- •min_samples_split: Minimum samples to split a node. Higher = less overfitting.
- •max_features: Features per split. 'sqrt' is a good default.
XGBoost Key Parameters
- •learning_rate: Step size (0.01-0.3). Lower = more trees needed but better accuracy.
- •max_depth: Tree depth (3-9). Lower = less overfitting.
- •n_estimators: Number of boosting rounds (100-1000).
- •subsample: Fraction of samples per tree (0.6-1.0).
- •colsample_bytree: Fraction of features per tree (0.6-1.0).
Common Pitfalls
- •Data Leakage: Never include the target variable or its derivatives in features.
- •Not Scaling: While tree-based models don't require scaling, it helps with interpretation.
- •Ignoring Feature Importance: This is your paper's goldmine. Always visualize and discuss it.
- •Overfitting: Always report cross-validation scores, not just test set scores.
- •Small Test Set: Use at least 20% of data for testing (or k-fold CV if data is limited).
Integration Workflow
- •Input: Use
data-cleanerfirst to handle missing values and outliers. - •Feature Engineering: Use
pca-analyzerif you have too many correlated features. - •Validation: Use
robustness-checkto test model stability with perturbed data. - •Sensitivity: Use
sensitivity-masterto analyze how predictions change with input variations. - •Visualization: Use
visual-engineerfor publication-quality plots.
Output Requirements for Paper
- •Model Specification: "We trained a Random Forest with 300 trees and max_depth=20."
- •Performance Metrics: "The model achieved R²=0.87, RMSE=2.34 on the test set."
- •Cross-Validation: "5-fold CV yielded mean R²=0.85±0.03, demonstrating robust generalization."
- •Feature Importance Plot: Bar chart showing top 10 features (MUST include).
- •Prediction Scatter Plot: Actual vs Predicted with perfect prediction line.
- •Interpretation: "GDP per capita was the most influential factor, accounting for 35% of model importance."
Advanced: Partial Dependence Plot (Optional)
For deeper interpretability, show how predictions change with one feature while holding others constant:
python
from sklearn.inspection import PartialDependenceDisplay
fig, ax = plt.subplots(figsize=(10, 6))
PartialDependenceDisplay.from_estimator(
model.model, X_train, features=[0, 1], ax=ax
)
plt.tight_layout()
plt.savefig('partial_dependence.png', dpi=300)
This is O-Prize level analysis showing causal-like relationships.
References & Tutorials
Integrated Course Materials
This skill now includes comprehensive regression tutorials from 数模加油站:
references/ Folder:
- •17-回归分析概述.pdf: Overview of regression analysis theory
- •18-一元线性回归模型.pdf: Simple linear regression (single variable)
- •19-多元线性回归模型.pdf: Multiple linear regression
- •20-非线性回归模型.pdf: Nonlinear regression methods
examples/ Folder:
- •Regress.py / Regress.m: Simple linear regression implementation (第18课)
- •MultipleLinearRegression.ipynb / .mlx: Multiple regression examples (第19课)
- •code1.m, code2.py, code3.py, etc.: Nonlinear regression examples (第20课)
- •coe1.py, volum.m: Additional regression utilities
Learning Path
- •Beginners: Start with 第17课 (overview) → 第18课 (simple regression)
- •Intermediate: 第19课 (multiple regression) → Run
MultipleLinearRegression.ipynb - •Advanced: 第20课 (nonlinear) → Use Random Forest/XGBoost (this skill)
When to Use Traditional Regression vs ML
| Scenario | Method | Reason |
|---|---|---|
| Need interpretable coefficients | Linear regression (第18-19课) | Clear β values |
| Small sample (< 50) | Simple regression (第18课) | Avoid overfitting |
| Complex nonlinear | Random Forest / XGBoost (this skill) | Higher accuracy |
| Feature selection critical | This skill | Feature importance built-in |
Time Budget
- •Data preparation: 30-60 min (use
data-cleaner) - •Model training: 20-40 min (with hyperparameter tuning)
- •Feature importance analysis: 15-30 min
- •Validation & plots: 30-45 min
- •Total: 1.5-3 hours
Related Skills
- •
data-cleaner: Preprocessing before regression - •
pca-analyzer: Dimensionality reduction for many features - •
sensitivity-master: Analyze prediction sensitivity - •
visual-engineer: Publication-quality plots