AgentSkillsCN

Data Cleaning

数据清洗

SKILL.md

Data Cleaning Skill: Development Economics Panel Data

Description

Clean and merge cross-country panel datasets for development economics research. Use when processing raw data from the World Bank, Penn World Table, WGI, or similar sources.

Country Code Standardization

Always use ISO 3166-1 alpha-3 codes as the primary identifier. Common issues:

  • World Bank uses non-standard codes for some territories (e.g., XKX for Kosovo)
  • Some datasets use country names instead of codes — use pycountry for mapping
  • Always check: are there aggregates mixed in? (e.g., "World", "OECD", "Sub-Saharan Africa") — drop these
python
# Standard approach
import pycountry

def standardize_country_code(name_or_code):
    """Try to match to ISO 3166-1 alpha-3."""
    try:
        return pycountry.countries.lookup(name_or_code).alpha_3
    except LookupError:
        return None  # Flag for manual review

Common Data Sources & Quirks

World Bank WDI (GDP PPP pc)

  • Indicator: NY.GDP.PCAP.PP.KD (constant 2021 international $)
  • Watch out for: series breaks, missing years for conflict states
  • Download format: wide (years as columns) — reshape to long

World Governance Indicators (Rule of Law)

  • Indicator: RL.EST (Rule of Law: Estimate)
  • Range: approximately -2.5 to +2.5
  • Available from 1996, annual from 2002
  • Use average of 2015-2019 as pre-treatment baseline

Penn World Table

  • Alternative to WDI for GDP
  • rgdpe / pop for GDP per capita
  • More methodological transparency, slower to update

Panel Data Assembly Checklist

  1. Load each dataset separately
  2. Reshape to long format: columns = [country_code, year, variable]
  3. Standardize country codes across all datasets
  4. Merge on (country_code, year) using outer join first to see coverage
  5. Check merge quality: how many matched? Unmatched?
  6. Decide on sample: balanced vs. unbalanced panel
  7. For balanced: keep only countries with full coverage across all variables
  8. Handle missing values:
    • Linear interpolation for isolated gaps (max 2 years)
    • Drop countries with >20% missing in key variables
    • Document every decision

Variable Construction

python
# Treatment assignment (time-invariant)
baseline_rol = df[df['year'].between(2015, 2019)].groupby('country_code')['rule_of_law'].mean()
median_rol = baseline_rol.median()
df['high_rol'] = df['country_code'].map(lambda x: int(baseline_rol.get(x, float('nan')) >= median_rol))

# Post-treatment indicator
df['post2022'] = (df['year'] >= 2023).astype(int)

# Interaction (the DiD coefficient)
df['high_rol_post2022'] = df['high_rol'] * df['post2022']

# Log transformation
df['log_gdp_ppp_pc'] = np.log(df['gdp_ppp_pc'])

Quality Checks (run after every merge)

python
# Duplicates
assert df.duplicated(subset=['country_code', 'year']).sum() == 0, "Duplicates found!"

# Panel balance
panel_check = df.groupby('country_code')['year'].nunique()
print(f"Countries: {df['country_code'].nunique()}")
print(f"Years per country — min: {panel_check.min()}, max: {panel_check.max()}, median: {panel_check.median()}")

# Missing values
print(df.isnull().sum())
print(f"Treatment group sizes: {df.groupby('high_rol')['country_code'].nunique()}")