AgentSkillsCN

education-data-source-scorecard

College Scorecard 数据源,涵盖毕业后收入(来自 IRS/财政部)、学生债务与还款(来自 NSLDS),以及完成率等指标。适用于分析毕业后的收入、贷款偿还、债务水平,或在理解 Scorecard 对第四类资助接收者的限制至关重要的时候使用。仅覆盖第四类联邦资助的接收者,而非所有学生。

SKILL.md
--- frontmatter
name: education-data-source-scorecard
description: >-
  College Scorecard data source for post-college outcomes including earnings
  from IRS/Treasury, student debt and repayment from NSLDS, and completion
  metrics. Use when analyzing post-graduation earnings, loan repayment, debt
  levels, or when understanding Scorecard's Title IV recipient limitation is
  critical. Covers only Title IV federal aid recipients, not all students.
metadata:
  audience: data-analysts
  domain: education-data

Scorecard Data Source Reference

Federal data on post-college outcomes including earnings, debt, and repayment for students who received Title IV financial aid. Links education records to IRS tax data for actual earnings, making it the primary source for post-college labor market outcomes.

CRITICAL: Value Encoding and Missing Data

The Education Data Portal uses integer encodings for all categorical variables and lowercase, restructured variable names that differ from the original Scorecard column names. Suppression encoding differs by dataset:

  • Earnings/counts: -3 integer code is the primary suppression indicator
  • Yes/No flags (institutional characteristics): null for missing, 0/1 for valid
  • Rates (repayment, default): null for missing
  • The original Scorecard string "PrivacySuppressed" does NOT appear in Portal data
Contextpred_degree_awarded_ipedsHBCU / tribal flagsreligious_affiliation
Portal (integer)0-40 / 1Integer codes 22-200
Original ScorecardString labelsString labelsString labels

See ./references/variable-definitions.md for complete encoding tables.

What is College Scorecard?

  • Publisher: U.S. Department of Education
  • Primary value: Post-college labor market outcomes (earnings) and debt/repayment metrics
  • Data sources: NSLDS (loans/aid), IRS/Treasury (earnings), IPEDS (institutional characteristics)
  • Coverage: Title IV federal aid recipients only — not all students
  • Unique feature: Links education to IRS tax records for actual earnings data
  • Access: Education Data Portal mirrors (parquet/CSV); see datasets-reference.md for paths, mirrors.yaml for mirror config, fetch-patterns.md for fetch code
  • Primary identifier: unitid (IPEDS institution ID)

Reference File Structure

FilePurposeWhen to Read
earnings-data.mdPost-college earnings methodology, cohorts, time horizonsAnalyzing earnings outcomes
debt-repayment.mdStudent debt, repayment rates, default ratesAnalyzing debt or loan outcomes
completion-rates.mdCompletion metrics vs IPEDSComparing graduation rates
population-coverage.mdTitle IV limitation details, who is included/excludedUnderstanding data representativeness
variable-definitions.mdKey variables, naming conventions, special valuesBuilding queries or interpreting results
data-quality.mdSuppression rules, selection bias, known limitationsAssessing data reliability
field-of-study.mdProgram-level earnings and debt dataAnalyzing outcomes by major/CIP code

Decision Trees

What outcome am I researching?

code
Outcome type?
├─ Post-college earnings
│   ├─ Institution-level → ./references/earnings-data.md
│   └─ By field of study → ./references/field-of-study.md
├─ Student debt levels
│   ├─ Cumulative borrowing → ./references/debt-repayment.md
│   └─ Debt by field → ./references/field-of-study.md
├─ Loan repayment/default
│   └─ Repayment rates → ./references/debt-repayment.md
├─ Completion rates
│   └─ Scorecard completion → ./references/completion-rates.md
└─ Understanding limitations
    ├─ Who is included → ./references/population-coverage.md
    └─ Data quality issues → ./references/data-quality.md

How do I interpret this data?

code
Interpretation question?
├─ Why are earnings suppressed?
│   └─ Privacy thresholds → ./references/data-quality.md
├─ What does "6-year earnings" mean?
│   └─ Cohort timing → ./references/earnings-data.md
├─ Why don't Scorecard rates match IPEDS?
│   └─ Different cohorts → ./references/completion-rates.md
├─ What loans are included in debt?
│   └─ Federal only → ./references/debt-repayment.md
└─ How representative is this data?
    └─ Title IV coverage → ./references/population-coverage.md

Building a query?

code
Query construction?
├─ Variable names and codes → ./references/variable-definitions.md
├─ Suppression flags to handle → ./references/data-quality.md
├─ Understanding cohort years → ./references/earnings-data.md
└─ Field-level queries → ./references/field-of-study.md

Quick Reference: Scorecard Variables

Portal Data Structure (CRITICAL)

The Portal uses LONG format with time horizon as a column, NOT the WIDE format from original Scorecard bulk download files. Portal column names are all lowercase and differ significantly from original Scorecard names.

Original Scorecard (WIDE)Portal Column (LONG)How to Get
MD_EARN_WNE_P6earnings_medFilter: years_after_entry == 6
MD_EARN_WNE_P10earnings_medFilter: years_after_entry == 10
COUNT_WNE_P6count_workingFilter: years_after_entry == 6
MN_EARN_WNE_P6earnings_meanFilter: years_after_entry == 6
CONTROL, INSTNMNOT IN EARNINGSJoin to IPEDS directory or inst_characteristics dataset
CDR3default_rateIn repayment_fsa dataset; filter: years_since_entering_repay
RPY_3YR_RTrepay_rateIn repayment_nslds dataset; filter: years_since_entering_repay == 3

Earnings Dataset Columns (Actual Portal Names)

Source dataset: scorecard/colleges_scorecard_earnings (203,066 rows x 33 columns)

Portal ColumnTypeDescriptionOriginal Scorecard
unitidInt64IPEDS institution IDUNITID
opeidStringOPE ID (8-digit, zero-padded)OPEID
yearInt64Data year (2003-2014, 2018)File year
years_after_entryInt64Years since first enrollment (6-10)Encoded in variable name
cohort_yearInt64Entry cohort yearEncoded in variable name
earnings_medInt64Median earnings (W-2)MD_EARN_WNE_P*
earnings_meanInt64Mean earningsMN_EARN_WNE_P*
earnings_sdInt64Standard deviation of earningsSD_EARN_WNE_P*
earnings_pct10Int6410th percentile earningsPCT10_EARN_WNE_P*
earnings_pct25Int6425th percentile earningsPCT25_EARN_WNE_P*
earnings_pct75Int6475th percentile earningsPCT75_EARN_WNE_P*
earnings_pct90Int6490th percentile earningsPCT90_EARN_WNE_P*
count_workingInt64Count working and not enrolledCOUNT_WNE_P*
count_not_workingInt64Count not working and not enrolledCOUNT_NWNE_P*
earnings_greater_than_25k_pctFloat64Share earning > $25KGT_25K_P*
earnings_lowinc_meanInt64Mean earnings, low-incomeMN_EARN_WNE_INC1_P*
earnings_midinc_meanInt64Mean earnings, mid-incomeMN_EARN_WNE_INC2_P*
earnings_highinc_meanInt64Mean earnings, high-incomeMN_EARN_WNE_INC3_P*
earnings_dep_meanInt64Mean earnings, dependent students
earnings_dep_lowinc_meanInt64Mean earnings, dependent low-income
earnings_ind_meanInt64Mean earnings, independent students
earnings_female_meanInt64Mean earnings, female
earnings_male_meanInt64Mean earnings, male
count_working_*Int64Count working by subgroup

Key Identifiers

IDFormatLevelExampleNotes
unitid6-digit integerInstitution110635Same as IPEDS unitid; primary join key
opeid8-digit stringOPE (Title IV)"00100200"Zero-padded; present in all datasets
opeid6Integer6-digit OPE1002Numeric, no zero-padding

Data Timing

MetricDimension ColumnValuesTypical Lag
Earningsyears_after_entry6, 7, 8, 9, 10Data from 7+ years ago
Defaultyears_since_entering_repay2, 3Varies
Repaymentyears_since_entering_repay1, 3, 5, 7Varies

"After entry" means after first enrollment, not after graduation.

Categorical Value Encodings (Institutional Characteristics Dataset)

VariableValues
pred_degree_awarded_ipeds0=Not classified, 1=Certificate, 2=Associate's, 3=Bachelor's, 4=Graduate
Yes/No flags (HBCU, tribal, etc.)0=No, 1=Yes, null=Missing
religious_affiliation76 integer codes 22-200 (see variable-definitions.md for complete mapping), null=None/Missing

Missing Data Codes

CodeMeaningWhich Datasets
-3Suppressed for privacyEarnings dataset (earnings and count columns) — primary suppression indicator
nullMissing/not applicableInstitutional characteristics (yes/no flags), repayment/default (rates)
Positive numericActual valueEarnings, debt, counts, rates
python
import polars as pl

# Filter for valid earnings (handle -3 suppression code)
valid = df.filter(
    (pl.col("earnings_med").is_not_null()) &
    (pl.col("earnings_med") != -3)
)

# Filter for 6-year earnings specifically
six_yr_valid = valid.filter(pl.col("years_after_entry") == 6)

Data Access

Datasets for Scorecard are available via the mirror system. See datasets-reference.md for canonical paths, mirrors.yaml for mirror configuration, and fetch-patterns.md for fetch code patterns.

Codebooks are .xls files co-located with data in all mirrors. Use get_codebook_url() from fetch-patterns.md to construct download URLs.

Truth Hierarchy: When interpreting variable values, apply this priority:

  1. Actual data file (what you observe in the parquet/CSV) — this IS the truth
  2. Live codebook (.xls in mirror) — authoritative documentation, may lag
  3. This skill documentation — convenient summary, may drift from codebook

If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

All Scorecard Datasets (6 total)

DatasetPathCodebookTypeYears
Earningsscorecard/colleges_scorecard_earningsscorecard/codebook_colleges_scorecard_earningsSinglevaries
Defaultscorecard/colleges_scorecard_repayment_fsascorecard/codebook_colleges_scorecard_defaultSingle1996-2020
Institutional Characteristicsscorecard/colleges_scorecard_inst_characteristicsscorecard/codebook_colleges_scorecard_institutional-characteristicsSingle1996-2020
Repaymentscorecard/colleges_scorecard_repayment_nsldsscorecard/codebook_colleges_scorecard_repaymentSingle2007-2016
Student Characteristics (Aid)scorecard/colleges_scorecard_student_body_nsldsscorecard/codebook_colleges_scorecard_student-characteristics_aid-applicantsSingle1997-2016
Student Characteristics (Home)scorecard/colleges_scorecard_student_body_treasuryscorecard/codebook_colleges_scorecard_student-characteristics_home-neighborhoodSingle1997-2016

Scorecard naming note: Data file paths differ significantly from codebook paths. Notable mismatches: data repayment_fsa vs codebook default; data inst_characteristics vs codebook institutional-characteristics; data repayment_nslds vs codebook repayment; data student_body_nslds vs codebook student-characteristics_aid-applicants; data student_body_treasury vs codebook student-characteristics_home-neighborhood. Always use the exact paths shown above.

Fetching Data

python
import polars as pl
from fetch_utils import fetch_from_mirrors  # See fetch-patterns.md

# Fetch earnings data
earnings = fetch_from_mirrors("scorecard/colleges_scorecard_earnings")

# Filter by time horizon (LONG format — filter, don't use wide column names)
six_yr = earnings.filter(pl.col("years_after_entry") == 6)

# Filter for valid earnings (exclude -3 suppression code)
valid = six_yr.filter(
    (pl.col("earnings_med").is_not_null()) &
    (pl.col("earnings_med") != -3)
)

# Institution names/control are NOT in the earnings dataset.
# Join to inst_characteristics or IPEDS directory:
inst = fetch_from_mirrors("scorecard/colleges_scorecard_inst_characteristics",
                          years=[2020])
valid = valid.join(
    inst.select("unitid", "inst_name", "pred_degree_awarded_ipeds"),
    on="unitid", how="left"
)

Common Pitfalls

PitfallIssueSolution
"All graduates" claimsScorecard covers Title IV recipients only, not all studentsNote Title IV limitation prominently in any analysis
Wage comparisonComparing to BLS wages or Census income uses different populationsUse for relative comparisons, not absolute claims; document population differences
Ignoring suppressionMany programs have no data due to privacy thresholdsCheck suppression rates before analyzing; document coverage
Time lag ignoredEarnings reflect old cohorts (6-year = data from 7+ years ago)Document data vintage and cohort years explicitly
Total borrowing assumptionScorecard debt includes only federal loans, not privateState "federal loans only" when reporting debt figures
String codes from docsOriginal Scorecard uses string labels; Portal uses integersVerify actual data types in Portal parquet files; use integer codes
Wide-format variable namesUsing MD_EARN_WNE_P10 column name on Portal dataPortal uses LONG format — filter years_after_entry instead
Assuming null = suppressedEarnings dataset uses -3 for suppression, not nullFilter both: is_not_null() & != -3
Using uppercase namesOriginal Scorecard uses MD_EARN_WNE_P6; Portal uses earnings_medAlways use lowercase Portal names from actual data

Critical Limitation: Title IV Recipients Only

The single most important caveat for all Scorecard analysis:

Scorecard tracks ONLY students who received federal financial aid (Title IV):

  • Pell Grants
  • Federal student loans (Direct, Perkins, PLUS)
  • Federal work-study
Excluded GroupImpact
Full-pay studentsOften higher-income; different outcomes
Students with only state/institutional aidMissing from data
International studentsNot eligible for federal aid
Some graduate studentsIf they received no federal aid

Coverage varies dramatically by institution type:

Institution TypeTypical Title IV Coverage
For-profit colleges80-90%+
Community colleges60-80%
Public flagships50-70%
Selective private colleges30-50%

Data systematically overrepresents lower-income students who are more likely to need federal aid.

What Scorecard Data Does NOT Include

ExcludedWhy It Matters
Non-Title IV studentsOften higher-income; different outcomes
Self-employment income1099 income excluded from earnings
Students still in schoolNot working = not in earnings data
Private student loansOnly federal loans tracked
Students who left the countryLost to follow-up

Comparison: Scorecard vs IPEDS

AspectCollege ScorecardIPEDS
Who's trackedTitle IV aid recipientsFirst-time, full-time students
Includes part-timeYesNo (for grad rates)
Includes transfers-inYesNo (tracked at origin)
Outcome focusEarnings, debt, repaymentCompletion, retention
Data sourceNSLDS + IRSInstitution-reported

Related Data Sources

SourceRelationshipWhen to Use
education-data-source-ipedsInstitutional characteristics, enrollment, financeJoin on unitid for institution names, control type, enrollment context
education-data-source-pseoAlternative post-college earnings (Census LEHD)When broader population coverage needed (not limited to Title IV)
education-data-source-fsaFederal student aid detailsDeeper analysis of aid types and disbursements
education-data-explorerParent discovery skillFinding available endpoints
education-data-queryData fetchingDownloading parquet/CSV files

Topic Index

TopicReference File
Earnings methodology./references/earnings-data.md
Cohort definitions./references/earnings-data.md
IRS data matching./references/earnings-data.md
Earnings suppression./references/data-quality.md
Debt metrics./references/debt-repayment.md
Repayment rates./references/debt-repayment.md
Default rates./references/debt-repayment.md
NSLDS data./references/debt-repayment.md
Completion methodology./references/completion-rates.md
IPEDS comparison./references/completion-rates.md
Title IV coverage./references/population-coverage.md
Who is excluded./references/population-coverage.md
Selection bias./references/population-coverage.md
Variable names./references/variable-definitions.md
Special values./references/variable-definitions.md
Privacy suppression./references/data-quality.md
Data limitations./references/data-quality.md
Program-level data./references/field-of-study.md
CIP codes./references/field-of-study.md