Feature Engineering Framework
Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.
Core Components
FeatureEngineeringStrategies
Collection of static methods for feature engineering operations:
Numerical Features (if intepretability is not a concern)
- •
scale_numerical(df, columns, method)- Scale using 'standard', 'minmax', or 'robust' - •
create_bins(df, columns, n_bins, strategy)- Discretize using 'uniform', 'quantile', or 'kmeans' - •
create_polynomial_features(df, columns, degree)- Generate polynomial and interaction terms - •
create_interaction_features(df, column_pairs)- Create multiplication interactions - •
create_log_features(df, columns)- Log-transform for skewed distributions
Categorical Features
- •
encode_categorical(df, columns, method)- Encode using 'onehot', 'label', 'frequency', or 'hash' - •
create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)- Group-level statistics
Binary Features
- •
convert_to_binary(df, columns)- Convert Yes/No, True/False to 0/1 (data type to int)
Data Quality Validation
- •
validate_numeric_features(df, exclude_cols)- Verify all features are numeric (except ID columns) - •
validate_no_constants(df, exclude_cols)- Remove constant columns with no variance
Feature Selection
- •
select_features_variance(df, columns, threshold)- Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality. - •
select_features_correlation(df, columns, threshold)- Remove highly correlated features
FeatureEngineeringPipeline
Orchestrates multiple feature engineering steps with logging.
CRITICAL REQUIREMENTS:
- •ALL output features MUST be numeric (int or float) - DID analysis cannot use string/object columns
- •Preview data types BEFORE processing:
df.dtypesanddf.head()to check actual values - •Encode ALL categorical variables - strings like "degree", "age_range" must be converted to numbers
- •Verify output: Final dataframe should have
df.select_dtypes(include='number').shape[1] == df.shape[1] - 1(excluding ID column)
Usage Example
python
from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline
# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")
# Add feature engineering steps
pipeline.add_step(
FeatureEngineeringStrategies.convert_to_binary,
columns=['<column5>', '<column2>'],
description="Convert binary survey responses to 0/1"
).add_step(
FeatureEngineeringStrategies.encode_categorical,
columns=['<column3>', '<column7>'],
method='onehot',
description="One-hot encode categorical features"
).add_step(
FeatureEngineeringStrategies.scale_numerical,
columns=['<column10>', '<column1>'],
method='standard',
description="Standardize numerical features"
).add_step(
FeatureEngineeringStrategies.validate_numeric_features,
exclude_cols=['<ID Column>'],
description="Verify all features are numeric before modeling"
).add_step(
FeatureEngineeringStrategies.validate_no_constants,
exclude_cols=['<ID Column>'],
description="Remove constant columns with no predictive value"
).add_step(
FeatureEngineeringStrategies.select_features_variance,
columns=[], # Empty = auto-select all numerical
threshold=0.01,
description="Remove low-variance features"
)
# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)
# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]
# Get execution log
log_df = pipeline.get_log()
Input
- •A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)
Output
- •DataFrame with both original and engineered columns
- •Engineered feature names accessible via
pipeline.get_engineered_features() - •Execution log available via
pipeline.get_log()
Key Features
- •Multiple encoding methods for categorical variables
- •Automatic handling of high-cardinality categoricals
- •Polynomial and interaction feature generation
- •Built-in feature selection for dimensionality reduction
- •Pipeline pattern for reproducible transformations
Best Practices
- •Always validate data types before downstream analysis: Use
validate_numeric_features()after encoding - •Check for constant columns that provide no information: Use
validate_no_constants()before modeling - •Convert binary features before other transformations
- •Use one-hot encoding for low-cardinality categoricals
- •Use KNN imputation if missing value could be inferred from other relevant columns
- •Use hash encoding for high-cardinality features (IDs, etc.)
- •Apply variance threshold to remove constant features
- •Check correlation matrix before modeling to avoid multicollinearity
- •MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL