AgentSkillsCN

education-data-source-ccd

深入参考美国教育部关于公立 K-12 教育的主要数据库——数据核心(CCD)。适用于处理 CCD 数据时,理解调查组件、变量定义、数据质量问题、历史变迁,以及各州间的差异。对于解读公立学校和学区的入学人数、师资力量、财务状况与名录数据而言,至关重要。

SKILL.md
--- frontmatter
name: education-data-source-ccd
description: >-
  Deep reference for the Common Core of Data (CCD), the US Department of
  Education's primary database on public K-12 education. Use when working with
  CCD data to understand survey components, variable definitions, data quality
  issues, historical changes, and state-level variations. Essential for
  interpreting enrollment, staffing, finance, and directory data from public
  schools and districts.
metadata:
  audience: data-analysts
  domain: education-data

CCD Data Source Reference

The CCD is the Department of Education's comprehensive, annual, national database of all public elementary and secondary schools and school districts in the United States. It is the only federal dataset that provides a complete universe census (not a sample) of U.S. public K-12 education.

CRITICAL: Value Encoding

The Education Data Portal uses integer codes for categorical variables that differ from NCES's original string codes. Always verify codes against codebooks.

Contextschool_typecharterurban_centric_locale
Portal (integers)1 (Regular)0 (No) / 1 (Yes)11 (City-Large)
NCES original1-Regular schoolYes / No11-City: Large

Note: charter and magnet use 0/1 encoding, NOT 1=Yes / 2=No as some NCES documentation shows.

See ./references/variable-definitions.md for complete encoding tables.

What is CCD?

  • Primary K-12 database: DOE's authoritative source for public elementary/secondary education statistics
  • Universe survey: Covers ALL public schools and districts, not a sample
  • Annual collection: Data submitted by State Education Agencies (SEAs) each year
  • Six major components: Directory, Membership, Staffing, Finance (state and district), Dropout/Completers
  • Coverage: ~100,000 public schools and ~18,000 school districts nationwide
  • Historical depth: Data available from 1986 to present (varies by component)
  • Collector: National Center for Education Statistics (NCES) via EDFacts

Reference File Structure

FilePurposeWhen to Read
survey-components.mdDetailed coverage of each CCD survey componentUnderstanding what data is collected
data-collection.mdHow data flows from schools to NCES, timelines, respondent universeUnderstanding data provenance and timing
variable-definitions.mdKey variables, coding schemes, special valuesInterpreting specific data elements
data-quality.mdMissing data patterns, suppression, state variationsAssessing data reliability
historical-changes.mdDefinition changes, code revisions over timeLongitudinal analysis

Decision Trees

What CCD component do I need?

code
What information do you need?
├─ School/district names, addresses, contacts → Directory
│   └─ See ./references/survey-components.md#directory
├─ Student enrollment counts → Membership
│   ├─ By grade → Membership (grade disaggregation)
│   ├─ By race/ethnicity → Membership (race disaggregation)
│   ├─ By sex → Membership (sex disaggregation)
│   └─ See ./references/survey-components.md#membership
├─ Staff/teacher counts → Staffing
│   └─ See ./references/survey-components.md#staffing
├─ Revenue and expenditure → Finance
│   ├─ State-level totals → National Public Education Financial Survey
│   ├─ District-level detail → School District Finance Survey (F-33)
│   └─ See ./references/survey-components.md#finance
├─ Graduation/dropout rates → Dropout and Completers
│   └─ See ./references/survey-components.md#dropout-completers
└─ School type, charter status, locale → Directory
    └─ See ./references/survey-components.md#directory

Is this a data quality issue?

code
Unexpected data values?
├─ Negative numbers (-1, -2, -3, -9) → Missing data codes
│   └─ See ./references/variable-definitions.md#missing-data-codes
├─ Very different from prior year → Check for definition changes
│   └─ See ./references/historical-changes.md
├─ State appears as outlier → Check state-specific reporting
│   └─ See ./references/data-quality.md#state-variations
├─ Large number of zeros → Check suppression rules
│   └─ See ./references/data-quality.md#suppression
└─ Locale codes don't match → Pre/post 2006 locale system change
    └─ See ./references/historical-changes.md#locale-codes

Can I compare across time?

code
Building a time series?
├─ Race/ethnicity categories → Major change in 2010
│   └─ See ./references/historical-changes.md#race-ethnicity
├─ Locale codes → Completely revised in 2006
│   └─ See ./references/historical-changes.md#locale-codes
├─ School/district IDs → Check for ID changes
│   └─ See ./references/variable-definitions.md#identifiers
├─ Free/reduced lunch → CEP and direct certification changes
│   └─ See ./references/data-quality.md#frpl
└─ Finance data → Definition changes and inflation
    └─ See ./references/historical-changes.md#finance

Quick Reference: CCD Components

ComponentLevelKey VariablesYearsUpdate Cycle
DirectorySchool, LEA, StateName, address, type, status, locale, charter1986+Annual
MembershipSchool, LEA, StateEnrollment by grade, race, sex1986+Annual
StaffingSchool, LEA, StateFTE teachers, staff by category1987+Annual
Finance (State)StateRevenue, expenditure by source/function1989+Annual (1-2 yr lag)
Finance (District)LEARevenue, expenditure, per-pupil1989+Annual (2 yr lag)
Dropout/CompletersLEA, StateDropout counts, diploma recipients1991+Annual

Key Identifiers

Portal ColumnFormatLevelExampleNotes
ncessch12 charactersSchool010000100100State FIPS (2) + LEA suffix (5) + School (5)
leaid7 charactersDistrict0100001State FIPS (2) + State-assigned (5)
fips2 digitsState01Federal Information Processing Standard

ID Type Warning: ncessch and leaid may be String or Int64 depending on the dataset. In the Schools Directory, ncessch is String (preserving leading zeros); in enrollment data, ncessch is Int64. In the Districts Directory, leaid is Int64; in Finance data, leaid is String. Always check the actual dtype and cast as needed when joining across datasets.

Missing Data Codes

The Portal uses both null and negative integer codes to represent missing/special values. The specific pattern varies by dataset:

CodeMeaningWhen Used
nullNot availableCommon in Directory fields that don't apply to all years
-1Missing/not reportedData not reported by state
-2Not applicableItem doesn't apply to this entity
-3SuppressedData suppressed for privacy
-9Not reportedState did not report this item

Check actual data. Some datasets use null where others use -1 for effectively the same condition. Always check the observed values in the data before applying a blanket missing-value filter.

School Types (school_type)

CodeTypeDescription
1RegularStandard public school
2Special EducationFocuses on students with disabilities
3VocationalCareer/technical education focus
4AlternativeNon-traditional programs
5Reportable ProgramProgram within another school (2007-08+)

LEA Types (agency_type)

CodeTypeDescription
1RegularLocally governed school district
2ComponentDistrict sharing superintendent with others
3Supervisory UnionAdmin services for multiple districts
4Regional AgencyEducation service agency
5State-operatedState-run schools (deaf, blind, correctional)
6Federal-operatedFederal schools (BIE, DoDEA)
7Charter AgencyAll schools are charters (2007-08+)
8OtherDoesn't fit other categories (2007-08+)
9Specialized AgencySpecialized public agency (observed in data)

Grade -1 Encoding

In CCD enrollment data:

  • grade = -1 means Pre-Kindergarten, NOT missing data
  • grade = 99 means Total across all grades

Do NOT filter grade >= 0 — this removes all Pre-K students!

python
# WRONG - removes Pre-K students!
df = df.filter(pl.col("grade") >= 0)

# CORRECT
pre_k = df.filter(pl.col("grade") == -1)  # Pre-K only
k12 = df.filter(pl.col("grade").is_between(0, 12))  # K-12
total = df.filter(pl.col("grade") == 99)  # All grades

Portal Column Name Mapping

Variable Name Mapping: The Portal column urban_centric_locale contains locale codes. Some documentation may refer to this as simply locale. Use urban_centric_locale when filtering or selecting columns in Portal data.

Dataset-to-Component Mapping

Mirror DatasetCCD ComponentPath
Schools CCD DirectorySchool Directoryccd/schools_ccd_directory
Schools CCD EnrollmentSchool Membershipccd/schools_ccd_enrollment_{year}
Districts LEA DirectoryLEA Directoryccd/school-districts_lea_directory
Districts CCD EnrollmentLEA Membershipccd/schools_ccd_lea_enrollment_{year}
Districts CCD FinanceF-33 District Financeccd/districts_ccd_finance

Data Collection Flow

code
Schools → Local Education Agencies (LEAs)
                ↓
    State Education Agencies (SEAs)
                ↓
        EDFacts Submission System
                ↓
    NCES Quality Review & Editing
                ↓
        CCD Public Data Files

Timeline: Data for school year 20XX-YY typically submitted spring 20YY, released fall 20YY (preliminary) to spring 20YY+1 (provisional/final).

Data Access

Datasets for CCD are available via the mirror system. See datasets-reference.md for canonical paths, mirrors.yaml for mirror configuration, and fetch-patterns.md for fetch code patterns.

Key datasets (5 datasets; see datasets-reference.md for the authoritative list):

DatasetTypePathCodebook
School DirectorySingleccd/schools_ccd_directoryccd/codebook_schools_ccd_directory
School EnrollmentYearly (1986-2023)ccd/schools_ccd_enrollment_{year}ccd/codebook_schools_ccd_enrollment
District DirectorySingleccd/school-districts_lea_directoryccd/codebook_districts_ccd_directory
District EnrollmentYearly (1986-2023)ccd/schools_ccd_lea_enrollment_{year}ccd/codebook_districts_ccd_enrollment
District FinanceSingleccd/districts_ccd_financeccd/codebook_districts_ccd_finance

Codebooks are .xls files co-located with data in all mirrors. Use get_codebook_url() from fetch-patterns.md to construct download URLs:

python
url = get_codebook_url("ccd/codebook_schools_ccd_directory")

Truth Hierarchy: When interpreting variable values, apply this priority:

  1. Actual data file (what you observe in the parquet/CSV) -- this IS the truth
  2. Live codebook (.xls in mirror) -- authoritative documentation, may lag
  3. This skill documentation -- convenient summary, may drift from codebook

If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

Filtering

All filtering is done locally with Polars after download:

python
import polars as pl

# Filter by state (California)
df = df.filter(pl.col("fips") == 6)

# Filter by year
df = df.filter(pl.col("year").is_in([2020, 2021, 2022]))

# Get totals only (enrollment)
df = df.filter(pl.col("grade") == 99)

# Get specific grades (K-12)
df = df.filter(pl.col("grade").is_between(0, 12))

Finance Data Notes

  • Finance data lag: The latest available year in the mirror is 2020 (empirically verified). Finance data typically lags 2+ years behind current school year.
  • Finance dataset has 163 columns -- by far the most complex CCD dataset
  • Some finance columns use _total suffix (e.g., exp_current_instruction_total)
  • leaid is String type in Finance data (unlike the Districts Directory where it is Int64)

Common Pitfalls

PitfallIssueSolution
Summing gradesMisses ungraded studentsUse grade=99 (total) instead
Assuming -1 is missingIn grade data, -1 = Pre-KCheck variable format in codebook
Cross-state comparisonDifferent state definitionsCheck state methodology first
Using FRPL as poverty measureCEP schools show 100%Supplement with MEPS or SAIPE data
Locale time series2006 code system changeAnalyze pre/post-2006 separately
Charter school countsEarly years incompleteVerify against state records pre-2010
Dropout rate comparisonState definitions varyWithin-state comparisons only
Using NCES string codesPortal uses integersSee variable-definitions.md for mappings
Assuming charter=1/2Portal uses 0=No, 1=YesEmpirically verified; not NCES 1=Yes, 2=No
ID type across datasetsleaid/ncessch may be String or Int64Always check dtype before joining

Coverage Notes

What CCD Includes

  • All public schools (traditional, charter, magnet, alternative)
  • All public school districts and LEAs
  • Bureau of Indian Education (BIE) schools
  • Department of Defense Education Activity (DoDEA) schools
  • State-operated schools (deaf, blind, correctional)

What CCD Excludes

  • Private schools (use Private School Universe Survey - PSS)
  • Homeschool students
  • Postsecondary institutions (use IPEDS)
  • Detailed student-level data (CCD is aggregate only)

Related Data Sources

SourceRelationshipWhen to Use
education-data-source-edfactsCCD nonfiscal data flows through EDFactsSame underlying data
education-data-source-crdcBiennial; uses CCD school IDsNeed discipline, course access, equity data
education-data-source-saipeUses CCD district IDsNeed poverty estimates (better than FRPL)
education-data-source-mepsSchool-level poverty estimatesNeed school-level poverty (better than FRPL)
education-data-source-ipedsSeparate system for postsecondaryNeed college/university data
PSSPrivate school equivalentNeed private school data
education-data-source-nhgisCensus geography crosswalksNeed school-Census links
education-data-explorerParent discovery skillFinding available datasets
education-data-queryData fetching (mirror system)Downloading parquet/CSV files via fetch_from_mirrors()

Topic Index

TopicReference File
Directory survey./references/survey-components.md
Membership survey./references/survey-components.md
Staffing survey./references/survey-components.md
Finance surveys./references/survey-components.md
Dropout/completers./references/survey-components.md
Data collection process./references/data-collection.md
EDFacts submission./references/data-collection.md
Respondent universe./references/data-collection.md
NCES identifiers./references/variable-definitions.md
Missing data codes./references/variable-definitions.md
Grade codes./references/variable-definitions.md
Race/ethnicity codes./references/variable-definitions.md
Locale codes./references/variable-definitions.md
State-level variations./references/data-quality.md
Missing data patterns./references/data-quality.md
FRPL limitations./references/data-quality.md
Data suppression./references/data-quality.md
Locale code changes (2006)./references/historical-changes.md
Race/ethnicity changes (2010)./references/historical-changes.md
LEA type changes (2007)./references/historical-changes.md
ID changes over time./references/historical-changes.md