AgentSkillsCN

education-data-context

Urban Institute 教育数据门户数据集的数据溯源、注意事项与解读指南。在提取教育数据后使用,以理解数据的局限性、特殊取值、来源特有说明,以及学校、学区或高校数据的正确引用方式。

SKILL.md
--- frontmatter
name: education-data-context
description: Data provenance, caveats, and interpretation guidance for Urban Institute Education Data Portal datasets. Use after pulling education data to understand limitations, special values, source-specific notes, and proper citation for schools, districts, or colleges data.
metadata:
  audience: data-analysts
  domain: education-data

Education Data Context

This skill provides critical context for interpreting data from the Urban Institute Education Data Portal. Education data has source-specific limitations that can significantly affect analysis validity.

Why Data Context Matters

  • Source-specific limitations: Each data source (CCD, IPEDS, CRDC, etc.) has unique constraints
  • Missing values have meaning: Codes like -1, -2, -3 indicate specific conditions, not random missingness
  • Definitions change over time: Variable definitions, categories, and coding schemes evolve
  • State comparisons require caution: State-level data often cannot be directly compared
  • Citation is required: The ODC Attribution License mandates proper citation

Reference File Structure

Quick Context (This Skill)

FileContentWhen to Read
./references/ccd-context.mdK-12 schools/districts caveatsAfter pulling CCD data
./references/ipeds-context.mdCollege/university caveatsAfter pulling IPEDS data
./references/crdc-context.mdCivil rights data caveatsAfter pulling CRDC data
./references/scorecard-context.mdCollege Scorecard caveatsAfter pulling Scorecard data
./references/edfacts-context.mdAssessment/graduation caveatsAfter pulling EDFacts data
./references/data-relationships.mdJoining tables, identifiersWhen merging datasets

Deep-Dive Source Skills (Comprehensive Documentation)

For comprehensive understanding beyond the quick context files above, load the dedicated data source skill:

Data SourceDeep-Dive SkillKey Deep Topics
CCDeducation-data-source-ccdSurvey components, EDFacts submission, state variations, historical changes
CRDCeducation-data-source-crdcCivil rights legal context, underreporting issues, year-to-year evolution
EDFactseducation-data-source-edfactsESSA/NCLB context, why states aren't comparable, ACGR methodology
IPEDSeducation-data-source-ipedsAll 12+ surveys, graduation rate population limits, GASB vs FASB
Scorecardeducation-data-source-scorecardIRS earnings methodology, Title IV selection bias, suppression rules
SAIPEeducation-data-source-saipeModel-based estimation, no district confidence intervals
FSAeducation-data-source-fsaTitle IV programs, financial responsibility scores, 90/10 rule
MEPSeducation-data-source-mepsSuperior to FRPL for cross-state poverty comparison
NHGISeducation-data-source-nhgisCensus geography links, boundary changes over time
NACUBOeducation-data-source-nacuboEndowment study methodology, voluntary participation bias
NCCSeducation-data-source-nccsForm 990 data, NTEE codes, private college relevance
EADAeducation-data-source-eadaTitle IX context, not same as compliance data
Campus Safetyeducation-data-source-campus-safetyClery Act, underreporting, geography definitions
PSEOeducation-data-source-pseoLEHD methodology, experimental status, state coverage

When to load deep-dive skills:

  • Need to understand data collection methodology in detail
  • Analyzing historical trends and need to know about definition changes
  • Encountering data quality issues that require deeper investigation
  • Writing documentation or reports that require precise methodology descriptions

Decision Trees

What data source did I pull from?

code
What endpoint did you use?
├─ schools/ccd/* → Read ./references/ccd-context.md
│   └─ Need more depth? → Load education-data-source-ccd skill
├─ school-districts/* → Read ./references/ccd-context.md
│   └─ Need more depth? → Load education-data-source-ccd skill
├─ schools/crdc/* → Read ./references/crdc-context.md
│   └─ Need more depth? → Load education-data-source-crdc skill
├─ schools/edfacts/* → Read ./references/edfacts-context.md
│   └─ Need more depth? → Load education-data-source-edfacts skill
├─ schools/meps/* → Load education-data-source-meps skill
├─ college-university/ipeds/* → Read ./references/ipeds-context.md
│   └─ Need more depth? → Load education-data-source-ipeds skill
├─ college-university/scorecard/* → Read ./references/scorecard-context.md
│   └─ Need more depth? → Load education-data-source-scorecard skill
├─ college-university/fsa/* → Load education-data-source-fsa skill
├─ college-university/nacubo/* → Load education-data-source-nacubo skill
├─ college-university/eada/* → Load education-data-source-eada skill
├─ college-university/pseo/* → Load education-data-source-pseo skill
├─ school-districts/saipe/* → Load education-data-source-saipe skill
└─ Multiple sources → Read ./references/data-relationships.md first

How do I interpret missing values?

code
What value do you see?
├─ In a CATEGORICAL column (grade, race, sex)?
│   └─ These use integer encoding, NOT coded missing values!
│       ├─ grade = -1 means Pre-K (NOT missing!)
│       ├─ race = 1-7 (NOT WH, BL, HI strings)
│       └─ sex = 1-2 (NOT M, F strings)
├─ In a NUMERIC column (enrollment, FTE, counts)?
│   ├─ -1 → Missing/not reported (treat as NULL)
│   ├─ -2 → Not applicable (exclude from that variable's analysis)
│   └─ -3 → Suppressed for privacy (cannot recover)
├─ null/blank?
│   └─ Source matters:
│       ├─ CCD, CRDC, EDFacts → Should use -1/-2/-3 codes
│       └─ Scorecard, MEPS, NACUBO → Use native nulls
├─ Ranges (e.g., "10-20") → EDFacts suppression bounds
└─ Unsure → Check source-specific reference file

What are the limitations?

code
What type of analysis are you doing?
├─ Cross-state comparison
│   ├─ K-12 assessments → INVALID (states not comparable)
│   ├─ K-12 other metrics → Check state reporting consistency
│   └─ College data → Generally valid (federal definitions)
├─ Time series
│   ├─ Check for definition changes
│   ├─ Check for ID changes (schools/districts merge/split)
│   └─ Check COVID-19 impact (2020-2021)
├─ Subgroup analysis
│   ├─ Check suppression rates
│   ├─ Smaller groups = more suppression
│   └─ Cannot impute suppressed values accurately
└─ Graduate outcomes
    ├─ IPEDS → First-time full-time only
    └─ Scorecard → Title IV recipients only

Universal Data Caveats

Portal Integer Encoding System

CRITICAL: The Education Data Portal uses integer codes, not string labels, for categorical variables. This applies to all sources.

Demographic Variable Encodings

VariableInteger ValuesNOT Strings
Race1-7, 99 (total)Not WH, BL, HI, AS, etc.
Sex1 (Male), 2 (Female), 3 (Another gender, IPEDS 2022+), 4 (Unknown gender, IPEDS 2022+), 9 (Unknown), 99 (Total)Not M, F
Grade-1 to 13, 99 (total)Not PK, KG, 01, etc.

Race codes:

ValueMeaning
1White
2Black
3Hispanic
4Asian
5American Indian/Alaska Native
6Native Hawaiian/Pacific Islander
7Two or more races
8Nonresident alien (postsecondary only)
9Unknown
99Total (all races)

Grade codes:

ValueMeaning
-1Pre-K (SEMANTIC TRAP: NOT missing data!)
0Kindergarten
1-12Grades 1-12
13Ungraded
99Total (all grades)

SEMANTIC TRAP - Grade -1: In CCD enrollment data, grade = -1 means Pre-Kindergarten, NOT missing data. This is a common source of errors. Missing data in enrollment uses the separate coded value system (-1/-2/-3) only for numeric fields like enrollment counts, not for the grade categorical variable.

python
# WRONG - filters out Pre-K students!
df = df.filter(pl.col("grade") >= 0)

# RIGHT - Pre-K students have grade = -1
pre_k = df.filter(pl.col("grade") == -1)
k_12 = df.filter(pl.col("grade").is_between(0, 12))
total = df.filter(pl.col("grade") == 99)

Variable Names Are Lowercase

Portal variable names are lowercase, not the uppercase names from original NCES documentation:

  • enrollment not MEMBER or ENROLLMENT
  • grade not GRADE
  • fips not FIPS or STATE

Missing Value Codes

CodeMeaningHow to Handle
-1Missing/not reportedTreat as NULL; document missingness rate
-2Not applicableExclude from analysis of that variable
-3Suppressed (privacy)Cannot be recovered; affects small-cell analyses
null/blankGenuinely missingTreat as NULL

IMPORTANT: Coded values (-1/-2/-3) apply to numeric measure columns (enrollment counts, FTE, etc.), NOT to categorical identifier columns like grade, race, or sex. Those use the integer encoding system above.

Missing Data Handling Varies by Source:

SourceMissing Data Pattern
CCD, CRDC, EDFactsUse -1/-2/-3 coded values for numeric fields
Scorecard, MEPS, NACUBOUse native null values
IPEDSMix of both (check specific variables)

Important: Filter coded values BEFORE calculating statistics:

python
# WRONG - includes coded values in mean
df["enrollment"].mean()

# RIGHT - exclude coded missing values
df.filter(pl.col("enrollment") >= 0)["enrollment"].mean()

Year Definitions

  • year refers to the FALL of the academic year
  • year=2020 means the 2020-21 school year
  • Graduation rates use cohort entry year (cohort started 4-6 years prior)
  • Finance data may use fiscal year (varies by institution)
Data TypeYear Interpretation
Fall enrollmentFall of indicated year
Academic year totalsFull year starting fall of indicated year
Graduation ratesCohort entry year (outcomes measured later)
CompletionsDegrees awarded during indicated academic year

Suppression

Data is suppressed to protect student privacy:

  • Small cell sizes: Typically fewer than 5-10 students
  • Affects disaggregated data: Race, disability, gender breakdowns
  • More suppression in smaller schools: Rural areas most affected
  • Cannot be imputed accurately: Do not attempt to recover
  • Complementary suppression: Other cells may be suppressed to prevent calculation

State Reporting Variation

State education agencies interpret federal definitions differently:

  • Dropout definitions vary (CCD covers grades 7-12, CPS covers 10-12)
  • Average daily attendance calculated differently by state law
  • Discipline categories interpreted inconsistently
  • Missing data tends to cluster by state

Data Quality Checklist

Before analyzing any Education Data Portal data:

  • Check coded values: Filter out -1, -2, -3 before calculations
  • Understand year definition: Fall of academic year vs. cohort year
  • Note suppression rates: Calculate % suppressed by variable
  • Check definition changes: Compare codebooks across years
  • Verify identifier consistency: NCES IDs can change when schools/districts merge
  • Document state anomalies: Note any state-specific reporting issues
  • Check coverage: Not all schools appear in all sources
  • Consider COVID-19: 2020-2021 data may not be comparable to prior years

Quick Coverage Check

python
# Check missingness and suppression by state
df.group_by("fips").agg([
    pl.col("variable").filter(pl.col("variable") == -1).count().alias("missing"),
    pl.col("variable").filter(pl.col("variable") == -3).count().alias("suppressed"),
    pl.col("variable").count().alias("total")
])

Citation Requirements

Full Citation Format

Use for publications, reports, and formal documents:

code
[Dataset name(s)], Education Data Portal (Version X.X.X), 
Urban Institute, accessed [Month DD, YYYY], 
https://educationdata.urban.org/documentation/, 
made available under the ODC Attribution License.

Example:

code
Common Core of Data (CCD) School Directory, Education Data Portal 
(Version 0.20.0), Urban Institute, accessed January 15, 2026, 
https://educationdata.urban.org/documentation/, 
made available under the ODC Attribution License.

Short Citation Format

Use for visualizations, dashboards, and space-constrained contexts:

code
Source: [Dataset name(s)], Education Data Portal v.X.X.X, 
Urban Institute, ODC-By License.

Example:

code
Source: CCD School Directory, Education Data Portal v.0.20.0, 
Urban Institute, ODC-By License.

License Terms

License: Open Data Commons Attribution License (ODC-By) v1.0

Key requirements:

  • Must attribute the Urban Institute as data source
  • Must indicate if data was modified
  • May use for any purpose including commercial
  • May redistribute with attribution

Notification

Email educationdata@urban.org with any published work using the data. This helps the Urban Institute track usage and improve the portal.

Quick Reference: Source-Specific Caveats

SourceKey LimitationCritical ForQuick ReferenceDeep Dive
CCDPublic schools only; state reporting variesK-12 enrollment, demographics./references/ccd-context.mdeducation-data-source-ccd
IPEDSFirst-time full-time students only for grad ratesCollege graduation analysis./references/ipeds-context.mdeducation-data-source-ipeds
CRDCBiennial; self-reported; underreportingEquity/discipline analysis./references/crdc-context.mdeducation-data-source-crdc
ScorecardTitle IV recipients only; earnings suppressedEarnings/outcomes analysis./references/scorecard-context.mdeducation-data-source-scorecard
EDFactsState assessments NOT comparable across statesAchievement analysis./references/edfacts-context.mdeducation-data-source-edfacts
SAIPEModel-based estimates; no district CIsDistrict povertyeducation-data-source-saipe
FSAFederal aid only; timing variesStudent aid analysiseducation-data-source-fsa
MEPSModel estimates; 100% FPL onlySchool poverty (cross-state)education-data-source-meps
NHGISBoundary changes over timeGeography linkingeducation-data-source-nhgis
EADASelf-reported; NOT Title IX complianceAthletics equityeducation-data-source-eada
Campus SafetyUnderreporting; comparability issuesCampus crimeeducation-data-source-campus-safety
PSEOExperimental; partial state coverageEmployment outcomeseducation-data-source-pseo

What Each Source Covers

SourceUniverseUpdate Frequency
CCDAll public schools and districtsAnnual
IPEDSAll Title IV postsecondary institutionsAnnual
CRDCSample/universe of public schoolsBiennial
ScorecardTitle IV aid recipientsAnnual
EDFactsPublic schools with state assessmentsAnnual

Data Lag Reference

Data availability lags behind the current year. As of January 2026:

SourceSurvey ComponentTypical LagLatest Available
IPEDSDirectory~1 year2023
IPEDSAdmissions-Enrollment~2 years2022
IPEDSFall Enrollment~2-3 years2021
IPEDSFinance~2-3 yearsVaries
CCDDirectory/Enrollment~1-2 years2022
CCDFinance~2-3 years2020
CRDCAll (biennial)~1-2 years2021
EDFactsAssessments~1-2 years2020
EDFactsGraduation Rates~1-2 years2020
SAIPEPoverty estimates~18 months2023
ScorecardEarnings/outcomes~2-3 years2020
MEPSSchool poverty~2-3 years2019

Always verify year availability before building pipelines. Use mirror discovery endpoints (see mirrors.yaml) or filter downloaded data to confirm which years are present. See education-data-query skill for mirror-based fetch patterns.

Common Analysis Mistakes

DO NOT:

  1. Compare state assessment scores across states (EDFacts)

    • Each state has different tests and cut scores
  2. Use IPEDS graduation rates to represent all students

    • Only tracks first-time, full-time students
  3. Assume Scorecard earnings represent all graduates

    • Only covers Title IV aid recipients
  4. Calculate statistics without filtering coded values

    • -1, -2, -3 are not zeros; they corrupt calculations
  5. Compare 2020-2021 data to prior years without noting COVID

    • Testing waivers, discipline changes, enrollment shifts
  6. Merge data across years assuming stable identifiers

    • Schools and districts merge, split, and change IDs

DO:

  1. Check suppression rates before disaggregating
  2. Use within-state comparisons for assessment data
  3. Document all data limitations in your analysis
  4. Verify identifier stability for longitudinal analyses
  5. Cite the data source properly

Cross-References

  • Variable definitions: Load education-data-explorer skill to understand what variables measure
  • Query assistance: Load education-data-query skill to re-fetch data with different parameters
  • Joining data: Read ./references/data-relationships.md for identifier mappings
  • Deep source context: Load the appropriate education-data-source-* skill for comprehensive methodology, historical changes, and detailed variable definitions
  • Source-specific gotchas: Load the relevant education-data-source-* skill for variable name mappings, data lags, and endpoint-specific behaviors

Topic Index

TopicLocation
Bureau of Indian Education schools./references/ccd-context.md
Charter school coverage./references/ccd-context.md
Chronic absenteeism./references/crdc-context.md
Citation formatThis file: Citation Requirements
COVID-19 data impact./references/crdc-context.md
Discipline data./references/crdc-context.md
Dropout definitions./references/ccd-context.md
Earnings data limitations./references/scorecard-context.md
Finance data (colleges)./references/ipeds-context.md
GASB vs FASB accounting./references/ipeds-context.md
Graduation rate caveats./references/ipeds-context.md
Identifier relationships./references/data-relationships.md
Joining tables./references/data-relationships.md
LEAID format./references/data-relationships.md
Locale codes./references/ccd-context.md
Missing value codesThis file: Universal Data Caveats
NCESSCH format./references/data-relationships.md
Net price calculation./references/ipeds-context.md
ODC-By LicenseThis file: Citation Requirements
OPEID vs UNITID./references/data-relationships.md
Private schools./references/ccd-context.md (not covered)
Proficiency data./references/edfacts-context.md
Race category changes./references/ccd-context.md
Sampling (CRDC)./references/crdc-context.md
State assessment comparability./references/edfacts-context.md
State FIPS codes./references/data-relationships.md
Student financial aid./references/ipeds-context.md
SuppressionThis file: Universal Data Caveats
Title IV institutions./references/ipeds-context.md
Transfer students./references/ipeds-context.md
UNITID changes./references/ipeds-context.md
Year definitionsThis file: Universal Data Caveats