AgentSkillsCN

education-data-query

从已配置的镜像源下载教育数据。适用于为研究获取教育数据、按来源下载数据集,或使用 Polars 在本地筛选大型文件时使用。

SKILL.md
--- frontmatter
name: education-data-query
description: Download education data from configured mirror sources. Use when fetching education data for research, downloading datasets by source, or filtering large files locally with Polars.
metadata:
  audience: data-analysts
  domain: education-data

Education Data Query

Download datasets from the Education Data Portal via configured mirror sources (defined in mirrors.yaml). Mirrors are tried in priority order. All filtering is done locally with Polars.

What This Skill Does

  • Download education datasets from configured mirrors
  • Handle multiple file formats (parquet, CSV) based on mirror read_strategy
  • Apply year, state, and demographic filters locally with Polars
  • Discover available files via each mirror's discovery endpoint

Reference File Structure

FilePurposeWhen to Read
mirrors.yamlMirror URLs, priority, format, timeouts, metadata configUnderstanding mirror configuration
fetch-patterns.mdCode patterns for mirror-based fetchingWriting Stage 5 fetch scripts
datasets-reference.mdKnown dataset file paths by sourceFinding the right file path for a dataset
filters-reference.mdComplete filter variablesFiltering downloaded data locally
query-patterns.mdEndpoint path structure referenceUnderstanding URL/path naming conventions

Mirror System Overview

Data is fetched by downloading files from mirrors:

code
Fetch Request (dataset, years, filters)
    → Try each mirror in priority order (per mirrors.yaml)
        → Build URL from mirror's url_template + dataset paths
        → Read using mirror's read_strategy (eager_parquet, lazy_csv, etc.)
    → If all mirrors fail: STOP and escalate
    → Save to data/raw/*.parquet
    → CP1 validation (source-agnostic)

Mirror Configuration

Mirrors are defined in ./references/mirrors.yaml with priority ordering. Each mirror specifies:

  • url_template — how to build download URLs
  • read_strategy — how Polars reads the format (eager_parquet, lazy_csv)
  • discovery — how to check what files are available

See ./references/mirrors.yaml for the full configuration and instructions on adding new mirrors.

Mirror File Discovery

Before fetching, you can check what files are available using each mirror's discovery endpoint (defined in mirrors.yaml):

python
# Generic discovery — works with any mirror that supports it
# See fetch-patterns.md for the full discover_mirror_files() function
from fetch_patterns import discover_mirror_files

# Check primary mirror
files = discover_mirror_files(MIRRORS[0])
if files is not None:
    print(f"Available files: {len(files)}")

This eliminates guessing — if the file exists in a mirror, use it; if not, fall through to the next.

Decision Trees

"How should I get this data?"

code
What dataset do you need?
├─ Know the exact file path?
│   └─ Use fetch_from_mirrors() with that path → ./references/fetch-patterns.md
├─ Know the source but not the exact filename?
│   └─ Check ./references/datasets-reference.md for known paths
├─ Not sure what's available?
│   └─ Query mirror discovery endpoint to list all files → ./references/fetch-patterns.md
├─ Need a codebook or metadata file?
│   └─ Check codebook column in ./references/datasets-reference.md → get_codebook_url() in ./references/fetch-patterns.md
└─ Dataset not in any mirror?
    └─ STOP and escalate — dataset may need to be added to mirror

"Is my dataset a single file or yearly files?"

code
Check datasets-reference.md:
├─ Type = "Single" → One file with all years
│   └─ Use fetch_from_mirrors() → filter years locally
└─ Type = "Yearly" → One file per year
    └─ Use fetch_yearly_from_mirrors() → concatenate results

"How do I filter results?"

All filtering is done locally with Polars after download:

python
# By state
df = df.filter(pl.col("fips") == 6)  # California

# By year
df = df.filter(pl.col("year").is_in([2020, 2021, 2022]))

# By school type
df = df.filter(pl.col("charter") == 1)

# Multiple filters
df = df.filter(
    (pl.col("fips") == 6) &
    (pl.col("charter") == 1) &
    (pl.col("school_level") == 3)
)

Dataset Path Structure

All mirrors use the same canonical path. Each mirror appends its own format extension (.parquet, .csv) via its url_template in mirrors.yaml:

code
{source}/{filename}
ComponentDescriptionExamples
sourceData sourceccd, ipeds, crdc, saipe, edfacts
filenameDataset fileschools_ccd_directory, districts_saipe

Example paths:

  • saipe/districts_saipe (SAIPE district poverty)
  • ccd/schools_ccd_directory (CCD school directory)
  • ccd/schools_ccd_enrollment_2022 (CCD enrollment, yearly)

See ./references/datasets-reference.md for the complete file path listing.

Format Handling

Format-specific read behavior is driven by each mirror's read_strategy field (see mirrors.yaml):

eager_parquet

python
df = pl.read_parquet(url)  # Polars reads HTTP URLs natively

lazy_csv

python
# Always use lazy loading for large files
df = (
    pl.scan_csv(url, infer_schema_length=10000)
    .filter(pl.col("year").is_in(YEARS))
    .filter(pl.col("fips") == STATE_FIPS)
    .collect()
)

See ./references/fetch-patterns.md for complete code patterns.

Portal Integer Encoding

CRITICAL: The Portal uses integer codes, not string labels. This affects filtering and interpretation.

Demographic Variable Encodings

VariableInteger ValuesNOT These Strings
Race1-7, 99 (total)WH, BL, HI, AS, etc.
Sex1 (Male), 2 (Female), 3 (Another gender, IPEDS 2022+), 4 (Unknown gender, IPEDS 2022+), 9 (Unknown), 99 (Total)M, F
Grade-1 to 13, 99 (total)PK, KG, 01, etc.

Grade Encoding (SEMANTIC TRAP!)

ValueMeaningURL Path Equivalent
-1Pre-K (NOT missing!)grade-pk
0Kindergartengrade-k
1-12Grades 1-12grade-1 to grade-12
99Totalgrade-99
python
# WRONG - filters out Pre-K students!
df = df.filter(pl.col("grade") >= 0)

# RIGHT - Pre-K students have grade = -1
pre_k = df.filter(pl.col("grade") == -1)
total = df.filter(pl.col("grade") == 99)

Variable Names Are Lowercase

Portal variable names are lowercase:

  • enrollment not MEMBER
  • grade not GRADE
  • fips not FIPS

See ./references/filters-reference.md for complete encoding tables.

Common FIPS Codes

CodeStateCodeStateCodeState
1Alabama17Illinois36New York
2Alaska18Indiana37North Carolina
4Arizona19Iowa39Ohio
5Arkansas20Kansas40Oklahoma
6California21Kentucky41Oregon
8Colorado22Louisiana42Pennsylvania
9Connecticut24Maryland44Rhode Island
10Delaware25Massachusetts45South Carolina
11DC26Michigan47Tennessee
12Florida27Minnesota48Texas
13Georgia29Missouri49Utah
15Hawaii32Nevada51Virginia
16Idaho34New Jersey53Washington

See ./references/filters-reference.md for complete list.

Cross-References

  • Discover endpoints: Load education-data-explorer skill to browse available endpoints and variables
  • Interpret data: Load education-data-context skill after fetching for variable meanings and caveats
  • Deep source understanding: Load education-data-source-* skills for comprehensive methodology

Data Source Skills Quick Reference

SourceSkillKey Fetch Considerations
CCDeducation-data-source-ccdUse grade-99 for totals; FRPL affected by CEP
CRDCeducation-data-source-crdcBiennial only; 2015+ for complete coverage; CSV requires schema_overrides for ID cols (see CRDC skill)
EDFactseducation-data-source-edfactsUse _midpt vars; states not comparable
IPEDSeducation-data-source-ipedsGRS limited to first-time full-time
Scorecardeducation-data-source-scorecardHigh suppression; Title IV recipients only
SAIPEeducation-data-source-saipeModel estimates; population not enrollment
FSAeducation-data-source-fsaFederal aid only; 1-3 year lag
MEPSeducation-data-source-mepsBetter than FRPL for cross-state
PSEOeducation-data-source-pseoExperimental; check state coverage

Topic Index

TopicLocation
Mirror configuration./references/mirrors.yaml
Fetch code patterns./references/fetch-patterns.md
Dataset file paths./references/datasets-reference.md
URL/path naming conventions./references/query-patterns.md
Filter variables./references/filters-reference.md
Codebook/metadata URLs./references/datasets-reference.md (codebook column), ./references/fetch-patterns.md (get_codebook_url)
FIPS codesThis file, ./references/filters-reference.md
CCD source detailseducation-data-source-ccd skill
CRDC source detailseducation-data-source-crdc skill
EDFacts source detailseducation-data-source-edfacts skill
IPEDS source detailseducation-data-source-ipeds skill
Scorecard source detailseducation-data-source-scorecard skill
SAIPE source detailseducation-data-source-saipe skill
FSA source detailseducation-data-source-fsa skill
MEPS source detailseducation-data-source-meps skill
NHGIS source detailseducation-data-source-nhgis skill