MicroImpute
MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
For Users 👥
What is MicroImpute?
When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
What imputation does:
- •Fills missing data in surveys
- •Uses machine learning to predict missing values
- •Maintains statistical relationships
- •Improves PolicyEngine accuracy
Example:
- •Survey asks about income but not capital gains breakdown
- •MicroImpute predicts short-term vs long-term capital gains
- •Based on patterns from IRS data
- •Result: More accurate tax calculations
You benefit from imputation when:
- •PolicyEngine calculates capital gains tax accurately
- •Benefits eligibility uses complete household information
- •State-specific calculations have all needed data
For Analysts 📊
Installation
pip install microimpute # With image export (for plots) pip install microimpute[images]
What MicroImpute Does
Imputation problem:
- •Donor dataset has complete information (e.g., IRS tax records)
- •Recipient dataset has missing variables (e.g., CPS survey)
- •Imputation predicts missing values in recipient using donor patterns
Methods available:
- •Linear regression
- •Random forest
- •Quantile forest (preserves full distribution)
- •XGBoost
- •Hot deck (traditional matching)
Quick Example
from microimpute import Imputer
import pandas as pd
# Donor data (complete)
donor = pd.DataFrame({
'income': [50000, 60000, 70000],
'age': [30, 40, 50],
'capital_gains': [5000, 8000, 12000] # Variable to impute
})
# Recipient data (missing capital_gains)
recipient = pd.DataFrame({
'income': [55000, 65000],
'age': [35, 45],
# capital_gains is missing
})
# Impute using quantile forest
imputer = Imputer(method='quantile_forest')
imputer.fit(
donor=donor,
donor_target='capital_gains',
common_vars=['income', 'age']
)
recipient_imputed = imputer.predict(recipient)
# Now recipient has predicted capital_gains
Method Comparison
from microimpute import compare_methods
# Compare different imputation methods
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='capital_gains',
common_vars=['income', 'age'],
methods=['linear', 'random_forest', 'quantile_forest']
)
# Shows quantile loss for each method
print(results)
Quantile Loss (Quality Metric)
Why quantile loss:
- •Measures how well imputation preserves the distribution
- •Not just mean accuracy, but full distribution shape
- •Lower is better
Interpretation:
# Quantile loss around 0.1 = good # Quantile loss around 0.5 = poor # Compare across methods to choose best
For Contributors 💻
Repository
Location: PolicyEngine/microimpute
Clone:
git clone https://github.com/PolicyEngine/microimpute cd microimpute
Current Implementation
To see structure:
tree microimpute/ # Key modules: ls microimpute/ # - imputer.py - Main Imputer class # - methods/ - Different imputation methods # - comparison.py - Method benchmarking # - utils/ - Utilities
To see specific methods:
# Quantile forest implementation cat microimpute/methods/quantile_forest.py # Random forest cat microimpute/methods/random_forest.py # Linear regression cat microimpute/methods/linear.py
Dependencies
Required:
- •numpy, pandas (data handling)
- •scikit-learn (ML models)
- •quantile-forest (distributional imputation)
- •optuna (hyperparameter tuning)
- •statsmodels (statistical methods)
- •scipy (statistical functions)
To see all dependencies:
cat pyproject.toml
Adding New Imputation Methods
Pattern:
# microimpute/methods/my_method.py
class MyMethodImputer:
def fit(self, X_train, y_train):
"""Train on donor data."""
# Fit your model
pass
def predict(self, X_test):
"""Impute on recipient data."""
# Return predictions
pass
def get_quantile_loss(self, X_val, y_val):
"""Compute validation loss."""
# Evaluate quality
pass
Usage in policyengine-us-data
To see how data pipeline uses microimpute:
cd ../policyengine-us-data # Find usage grep -r "microimpute" policyengine_us_data/ grep -r "Imputer" policyengine_us_data/
Typical workflow:
- •Load CPS (has demographics, missing capital gains details)
- •Load IRS PUF (has complete tax data)
- •Use microimpute to predict missing CPS variables from PUF patterns
- •Validate imputation quality
- •Save enhanced dataset
Testing
Run tests:
make test # Or pytest tests/ -v --cov=microimpute
To see test patterns:
cat tests/test_imputer.py cat tests/test_methods.py
Common Patterns
Pattern 1: Basic Imputation
from microimpute import Imputer
# Create imputer
imputer = Imputer(method='quantile_forest')
# Fit on donor (complete data)
imputer.fit(
donor=donor_df,
donor_target='target_variable',
common_vars=['age', 'income', 'state']
)
# Predict on recipient (missing target_variable)
recipient_imputed = imputer.predict(recipient_df)
Pattern 2: Choosing Best Method
from microimpute import compare_methods
# Test multiple methods
methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='target',
common_vars=common_vars,
methods=methods
)
# Use method with lowest quantile loss
best_method = results.sort_values('quantile_loss').iloc[0]['method']
Pattern 3: Multiple Variable Imputation
# Impute several variables
variables_to_impute = [
'short_term_capital_gains',
'long_term_capital_gains',
'qualified_dividends'
]
for var in variables_to_impute:
imputer = Imputer(method='quantile_forest')
imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
cps[var] = imputer.predict(cps)
Advanced Features
Hyperparameter Tuning
Built-in Optuna integration:
from microimpute import tune_hyperparameters
# Automatically find best hyperparameters
best_params, study = tune_hyperparameters(
donor=donor,
target_var='target',
common_vars=common_vars,
method='quantile_forest',
n_trials=100
)
# Use tuned parameters
imputer = Imputer(method='quantile_forest', **best_params)
Cross-Validation
Validate imputation quality:
from sklearn.model_selection import cross_val_score # Split donor for validation # Impute on validation set # Measure accuracy
Visualization
Plot imputation results:
import plotly.express as px
# Compare imputed vs actual (on donor validation set)
fig = px.scatter(
x=actual_values,
y=imputed_values,
labels={'x': 'Actual', 'y': 'Imputed'}
)
fig.add_trace(px.line(x=[min, max], y=[min, max])) # 45-degree line
Statistical Background
Imputation preserves:
- •Marginal distributions (imputed variable distribution matches donor)
- •Conditional relationships (imputation depends on common variables)
- •Uncertainty (quantile methods preserve full distribution)
Trade-offs:
- •Linear: Fast, but assumes linear relationships
- •Random forest: Handles non-linearity, may overfit
- •Quantile forest: Preserves full distribution, slower
- •XGBoost: High accuracy, requires tuning
Integration with PolicyEngine
Full pipeline (policyengine-us-data):
1. Load CPS survey data ↓ 2. microimpute: Fill missing variables from IRS PUF ↓ 3. microcalibrate: Adjust weights to match benchmarks ↓ 4. Validation: Check against administrative totals ↓ 5. Package: Distribute enhanced dataset ↓ 6. PolicyEngine: Use for population simulations
Comparison to Other Methods
MicroImpute vs traditional imputation:
Traditional (mean imputation):
- •Fast but destroys distribution
- •All missing values get same value
- •Underestimates variance
MicroImpute (ML methods):
- •Preserves relationships
- •Different predictions per record
- •Maintains distribution shape
Quantile forest advantage:
- •Predicts full conditional distribution
- •Not just point estimates
- •Can sample from predicted distribution
Performance Tips
For large datasets:
# Use random forest (faster than quantile forest) imputer = Imputer(method='random_forest') # Or subsample donor donor_sample = donor.sample(n=10000, random_state=42) imputer.fit(donor=donor_sample, ...)
For high accuracy:
# Use quantile forest with tuning best_params, _ = tune_hyperparameters(...) imputer = Imputer(method='quantile_forest', **best_params)
Related Skills
- •l0-skill - Regularization techniques
- •microcalibrate-skill - Survey calibration (next step after imputation)
- •policyengine-us-data-skill - Complete data pipeline
- •microdf-skill - Working with imputed/calibrated data
Resources
Repository: https://github.com/PolicyEngine/microimpute PyPI: https://pypi.org/project/microimpute/ Documentation: See README and docstrings in source