Data Cleaning Skill: Development Economics Panel Data

Description

Clean and merge cross-country panel datasets for development economics research. Use when processing raw data from the World Bank, Penn World Table, WGI, or similar sources.

Country Code Standardization

Always use ISO 3166-1 alpha-3 codes as the primary identifier. Common issues:

•World Bank uses non-standard codes for some territories (e.g., XKX for Kosovo)
•Some datasets use country names instead of codes — use pycountry for mapping
•Always check: are there aggregates mixed in? (e.g., "World", "OECD", "Sub-Saharan Africa") — drop these

python

# Standard approach
import pycountry

def standardize_country_code(name_or_code):
    """Try to match to ISO 3166-1 alpha-3."""
    try:
        return pycountry.countries.lookup(name_or_code).alpha_3
    except LookupError:
        return None  # Flag for manual review

Common Data Sources & Quirks

World Bank WDI (GDP PPP pc)

•Indicator: NY.GDP.PCAP.PP.KD (constant 2021 international $)
•Watch out for: series breaks, missing years for conflict states
•Download format: wide (years as columns) — reshape to long

World Governance Indicators (Rule of Law)

•Indicator: RL.EST (Rule of Law: Estimate)
•Range: approximately -2.5 to +2.5
•Available from 1996, annual from 2002
•Use average of 2015-2019 as pre-treatment baseline

Penn World Table

•Alternative to WDI for GDP
•rgdpe / pop for GDP per capita
•More methodological transparency, slower to update

Panel Data Assembly Checklist

•Load each dataset separately
•Reshape to long format: columns = [country_code, year, variable]
•Standardize country codes across all datasets
•Merge on (country_code, year) using outer join first to see coverage
•Check merge quality: how many matched? Unmatched?
•Decide on sample: balanced vs. unbalanced panel
•For balanced: keep only countries with full coverage across all variables
•
Handle missing values:
- •Linear interpolation for isolated gaps (max 2 years)
- •Drop countries with >20% missing in key variables
- •Document every decision

Variable Construction

python

# Treatment assignment (time-invariant)
baseline_rol = df[df['year'].between(2015, 2019)].groupby('country_code')['rule_of_law'].mean()
median_rol = baseline_rol.median()
df['high_rol'] = df['country_code'].map(lambda x: int(baseline_rol.get(x, float('nan')) >= median_rol))

# Post-treatment indicator
df['post2022'] = (df['year'] >= 2023).astype(int)

# Interaction (the DiD coefficient)
df['high_rol_post2022'] = df['high_rol'] * df['post2022']

# Log transformation
df['log_gdp_ppp_pc'] = np.log(df['gdp_ppp_pc'])

Quality Checks (run after every merge)

python

# Duplicates
assert df.duplicated(subset=['country_code', 'year']).sum() == 0, "Duplicates found!"

# Panel balance
panel_check = df.groupby('country_code')['year'].nunique()
print(f"Countries: {df['country_code'].nunique()}")
print(f"Years per country — min: {panel_check.min()}, max: {panel_check.max()}, median: {panel_check.median()}")

# Missing values
print(df.isnull().sum())
print(f"Treatment group sizes: {df.groupby('high_rol')['country_code'].nunique()}")