Feature Engineering Framework

Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.

Core Components

FeatureEngineeringStrategies

Collection of static methods for feature engineering operations:

Numerical Features (if intepretability is not a concern)

•scale_numerical(df, columns, method) - Scale using 'standard', 'minmax', or 'robust'
•create_bins(df, columns, n_bins, strategy) - Discretize using 'uniform', 'quantile', or 'kmeans'
•create_polynomial_features(df, columns, degree) - Generate polynomial and interaction terms
•create_interaction_features(df, column_pairs) - Create multiplication interactions
•create_log_features(df, columns) - Log-transform for skewed distributions

Categorical Features

•encode_categorical(df, columns, method) - Encode using 'onehot', 'label', 'frequency', or 'hash'
•create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs) - Group-level statistics

Binary Features

•convert_to_binary(df, columns) - Convert Yes/No, True/False to 0/1 (data type to int)

Data Quality Validation

•validate_numeric_features(df, exclude_cols) - Verify all features are numeric (except ID columns)
•validate_no_constants(df, exclude_cols) - Remove constant columns with no variance

Feature Selection

•select_features_variance(df, columns, threshold) - Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.
•select_features_correlation(df, columns, threshold) - Remove highly correlated features

FeatureEngineeringPipeline

Orchestrates multiple feature engineering steps with logging.

CRITICAL REQUIREMENTS:

•ALL output features MUST be numeric (int or float) - DID analysis cannot use string/object columns
•Preview data types BEFORE processing: df.dtypes and df.head() to check actual values
•Encode ALL categorical variables - strings like "degree", "age_range" must be converted to numbers
•Verify output: Final dataframe should have df.select_dtypes(include='number').shape[1] == df.shape[1] - 1 (excluding ID column)

Usage Example

python

from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline

# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")

# Add feature engineering steps
pipeline.add_step(
    FeatureEngineeringStrategies.convert_to_binary,
    columns=['<column5>', '<column2>'],
    description="Convert binary survey responses to 0/1"
).add_step(
    FeatureEngineeringStrategies.encode_categorical,
    columns=['<column3>', '<column7>'],
    method='onehot',
    description="One-hot encode categorical features"
).add_step(
    FeatureEngineeringStrategies.scale_numerical,
    columns=['<column10>', '<column1>'],
    method='standard',
    description="Standardize numerical features"
).add_step(
    FeatureEngineeringStrategies.validate_numeric_features,
    exclude_cols=['<ID Column>'],
    description="Verify all features are numeric before modeling"
).add_step(
    FeatureEngineeringStrategies.validate_no_constants,
    exclude_cols=['<ID Column>'],
    description="Remove constant columns with no predictive value"
).add_step(
    FeatureEngineeringStrategies.select_features_variance,
    columns=[],  # Empty = auto-select all numerical
    threshold=0.01,
    description="Remove low-variance features"
)

# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)

# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]

# Get execution log
log_df = pipeline.get_log()

Input

•A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)

Output

•DataFrame with both original and engineered columns
•Engineered feature names accessible via pipeline.get_engineered_features()
•Execution log available via pipeline.get_log()

Key Features

•Multiple encoding methods for categorical variables
•Automatic handling of high-cardinality categoricals
•Polynomial and interaction feature generation
•Built-in feature selection for dimensionality reduction
•Pipeline pattern for reproducible transformations

Best Practices

•Always validate data types before downstream analysis: Use validate_numeric_features() after encoding
•Check for constant columns that provide no information: Use validate_no_constants() before modeling
•Convert binary features before other transformations
•Use one-hot encoding for low-cardinality categoricals
•Use KNN imputation if missing value could be inferred from other relevant columns
•Use hash encoding for high-cardinality features (IDs, etc.)
•Apply variance threshold to remove constant features
•Check correlation matrix before modeling to avoid multicollinearity
•MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL