AgentSkillsCN

education-data-source-nhgis

IPUMS NHGIS(国家历史地理信息系统)的人口普查地理与人口数据,适用于教育研究。适用于运用人口普查地理数据、学校社区的人口数据、时间序列分析、地理交叉表,或将学校与人口普查区/街区关联时使用。门户网站数据已预先处理好交叉表;直接使用 NHGIS 需要注册 IPUMS。

SKILL.md
--- frontmatter
name: education-data-source-nhgis
description: >-
  IPUMS NHGIS (National Historical Geographic Information System) census
  geography and demographic data for education research. Use when working with
  census geography, demographic data for school communities, time series
  analysis, geographic crosswalks, or linking schools to census tracts/block
  groups. Portal data is pre-processed crosswalks; direct NHGIS requires
  IPUMS registration.
metadata:
  audience: data-analysts
  domain: education-data

NHGIS Data Source Reference

Census geography and demographic data source for education research. NHGIS provides the foundation for linking schools to community characteristics via census tracts, block groups, and school district boundaries.

CRITICAL: Value Encoding

When accessing NHGIS data through the Education Data Portal (not NHGIS directly), categorical variables use integer encodings, not string labels. Always verify the exact codes in the mirror codebook.

VariableInteger CodeMeaning
census_region1Northeast
census_region2Midwest
census_region3South
census_region4West
cbsa_type1Metropolitan
cbsa_type2Micropolitan
geocode_accuracy4Did not geocode

See ./references/variable-catalog.md for complete encoding tables.

What is NHGIS?

NHGIS (from IPUMS, University of Minnesota) provides free access to census geography and demographic data.

  • Collector: IPUMS, University of Minnesota
  • Coverage: US census data from 1790-present (decennial census + ACS)
  • Content: Summary tables, GIS boundary files, time series tables, geographic crosswalks
  • Frequency: Decennial census (every 10 years) + ACS (annual, 5-year rolling)
  • Available years: 1790-2020 (decennial), 2005-2023 (ACS 5-year)
  • Primary identifiers: GISJOIN (NHGIS internal), GEOID (Census Bureau standard)
  • Education relevance: Links school locations to community demographics via census tracts, block groups, and school district boundaries

Reference File Structure

FilePurposeWhen to Read
geographic-units.mdCensus geography hierarchy (tracts, blocks, districts)Understanding census geography
school-geography-links.mdLinking schools to census areasConnecting school data to demographics
time-series.mdHistorical data, harmonization methodsLongitudinal analysis
variable-catalog.mdKey demographic variables, codes, special valuesSelecting census variables or interpreting encodings
boundary-changes.mdHow boundaries change between censusesHandling geographic inconsistencies
data-access.mdDirect NHGIS access methods (registration, Data Finder, ipumspy)Custom census analysis beyond Portal

Decision Trees

What geographic level should I use?

code
Research question about...
├─ Individual schools
│   ├─ School's immediate neighborhood → Census tract or block group
│   ├─ School attendance zone → SABINS (limited years) or block-to-school crosswalk
│   └─ School district overall → School district boundaries
├─ School districts
│   ├─ District-level demographics → School district geographic level
│   ├─ Within-district variation → Census tracts within district
│   └─ District poverty estimates → SAIPE (via Education Data Portal)
├─ Regional patterns
│   ├─ County-level → County boundaries
│   ├─ Metro area → CBSA (Core Based Statistical Area)
│   └─ State-level → State boundaries
└─ Historical analysis
    ├─ Consistent boundaries needed → Geographically standardized tables
    └─ Original boundaries OK → Nominally integrated tables

How do I link schools to census data?

code
Linking schools to census demographics?
├─ Have school coordinates (lat/lon)
│   ├─ Point-in-polygon → Spatial join to tract/block group boundaries
│   └─ Need tract ID only → Geocoding service or FCC API
├─ Have school NCES ID only
│   ├─ Use NCES EDGE files → School District Geographic Relationship Files
│   └─ Use Education Data Portal → NHGIS source provides tract links
├─ Need school attendance zones
│   ├─ 2009-2012 data → SABINS school areas
│   └─ Current data → Contact school district (no national source)
└─ See ./references/school-geography-links.md for details

What time period data do I need?

code
Time period?
├─ Single recent year
│   ├─ Tract/block group level → ACS 5-year (most recent)
│   ├─ Larger areas (65K+ pop) → ACS 1-year
│   └─ Full census count → 2020 Decennial Census
├─ Historical comparison
│   ├─ Same boundaries across time → Geographically standardized tables (to 2010)
│   ├─ Original boundaries → Nominally integrated time series
│   └─ Custom standardization → Use geographic crosswalks
├─ Long time series (1970+)
│   └─ See ./references/time-series.md
└─ Pre-1970
    └─ Limited tract coverage; county/state more complete

Quick Reference: Geographic Levels and Variables

Geographic Levels

LevelTypical SizeEducation UseNHGIS Coverage
Block~40 peoplePoint locations1990-2020
Block Group~1,500 peopleSchool neighborhoods1990-2020
Census Tract~4,000 peopleCommunity context1910-2020
County SubdivisionVariesRural areas1980-2020
PlaceCity/townUrban context1980-2020
School DistrictVariesDistrict analysis2000-2020
County~100,000 peopleRegional patterns1790-2020
StateVariesPolicy analysis1790-2020

Key Identifiers

IDFormatLevelExampleNotes
ncesschInt64School10000201704NCES school ID (schools Portal data)
unitidInt64College100654IPEDS institution ID (colleges Portal data)
GISJOINString with prefixAnyG0600010NHGIS internal ID; use for direct NHGIS joins (not in Portal data)
GEOIDNumeric stringAny06001402100Census Bureau standard; use for non-NHGIS joins (not in Portal data)
tractInt64Tract402100Census tract number (in Portal data)
block_groupInt64Block Group1Block group within tract (1-9; 0=unassigned)
geoid_blockInt64Block60014021001001Full block FIPS code (in Portal data — stored as Int64, not String)
cbsaInt64Metro area41860Core Based Statistical Area code (2000+ census files only)

Key Education Variables

TopicExample VariablesSource
Child populationUnder 18, 5-17 school-ageDecennial, ACS
Race/ethnicityHispanic, White, Black, Asian, etc.Decennial, ACS
PovertyPersons below poverty, SNAP receiptACS (sample)
Education attainmentHS diploma, BA+ (adults)ACS (sample)
LanguageEnglish proficiency, language at homeACS (sample)
HousingOwner/renter, median value, crowdingDecennial, ACS
Family structureSingle-parent, grandparent householdsACS (sample)
ImmigrationForeign-born, recent immigrantsACS (sample)

Data Sources by Type

SourceYearsGeographic DetailContent
Decennial Census1790-2020Block (1990+)100% count: age, sex, race, housing
ACS 5-Year2005-2023Block groupSample: income, education, language
ACS 1-Year2010-2023Areas 65K+ popSample: same as 5-year
Time Series1790-2020VariesHarmonized across years
Geographic Crosswalks1990-2020Block+Interpolation weights

Portal Variables (Schools NHGIS)

Key geographic and identifying columns in the schools NHGIS datasets. Census 2020 files have 47 columns; earlier census years have fewer (e.g., 1990 has 35 columns — no CBSA or legislative district fields).

VariableDescriptionType
ncesschNCES school IDInt64
leaidNCES district IDInt64
tractCensus tract numberInt64
block_groupBlock group number (1-9; 0 = unassigned)Int64
geoid_blockFull block FIPS identifierInt64
census_regionCensus Bureau region (1-4, 9)Int64
census_divisionCensus Bureau division (1-9)Int64
cbsaCBSA code (2000+ census files only)Int64
cbsa_typeMetropolitan (1) or Micropolitan (2)Int64
cbsa_cityPrincipal city indicator (0=No, 1=Yes; 2000+ only). See note below.Int64
geocode_accuracyGeocode confidence (1=High, 2=Medium, 3=Low, 4=Did not geocode, -2=N/A)Float64
geocode_accuracy_detailedGeocode match type (1-12)Int64
class_codeFIPS place class codeInt64
lower_chamber_typeState legislative district lower chamber type (1-8; census 2010 only). See variable-catalog.md for code mapping.Int64
geo_latitude / geo_longitudeGeocoded coordinatesFloat64
latitude / longitudeCCD-reported coordinates (many nulls in early years)Float64
fipsState FIPS codeInt64
pumaPublic Use Microdata Area (2000+ census files only)Int64

Portal Variables (Colleges NHGIS)

Colleges NHGIS datasets have 38 columns (2020 census). Different identifier set from schools.

VariableDescriptionType
unitidIPEDS institution IDInt64
opeidOffice of Postsecondary Education IDString
tractCensus tract numberInt64
block_groupBlock group number (1-9)Int64
geoid_blockFull block FIPS identifierInt64
census_regionCensus Bureau region (1-4, 9)Int64
census_divisionCensus Bureau division (1-9)Int64
cbsaCBSA codeInt64
cbsa_typeMetropolitan (1) or Micropolitan (2)Int64
cbsa_cityPrincipal city indicator (0=No, 1=Yes; 2000+ only)Int64
geocode_accuracyGeocode match score (Int64 in colleges, Float64 in schools)Int64
county_fipsCounty FIPS codeInt64
county_nameCounty nameString
state_abbrState abbreviationString

Missing Data Codes

CodeMeaningWhen Used
-2Not geocodedgeocode_accuracy field in Portal data
-1Missing/not reportedGeneral missing data indicator (e.g., latitude, county_code)
0Unassignedblock_group (rare, ~4 rows in schools)
nullNot availableVariable not applicable to this record; many columns heavily null in early years

Schema Difference: Schools NHGIS 2020 files (47 columns) have a different schema than colleges NHGIS 2020 files (38 columns). Schools data includes school-specific identifiers (ncessch, leaid, school_name, mailing/location address fields) while colleges data includes institution-specific identifiers (unitid, opeid, inst_name, county_name). Both entity types have block-level geographic precision. Earlier census years have fewer columns (e.g., Schools 1990 has 35 columns — no CBSA or legislative district fields). Do not assume identical column structures when working across entities or census years.

Data Access

Datasets for NHGIS are available via the mirror system. See datasets-reference.md for canonical paths, mirrors.yaml for mirror configuration, and fetch-patterns.md for fetch code patterns.

DatasetTypeYearsPathCodebook
Schools Census 1990Single1986-2023nhgis/schools_nhgis_geog_1990nhgis/codebook_schools_nhgis_census1990
Schools Census 2000Single1986-2023nhgis/schools_nhgis_geog_2000nhgis/codebook_schools_nhgis_census2000
Schools Census 2010Single1986-2023nhgis/schools_nhgis_geog_2010nhgis/codebook_schools_nhgis_census2010
Schools Census 2020Single1986-2023nhgis/schools_nhgis_geog_2020nhgis/codebook_schools_nhgis_census2020
Colleges Census 1990Single1980-2023nhgis/colleges_nhgis_geog_1990nhgis/codebook_colleges_nhgis_census1990
Colleges Census 2000Single1980-2023nhgis/colleges_nhgis_geog_2000nhgis/codebook_colleges_nhgis_census2000
Colleges Census 2010Single1980-2023nhgis/colleges_nhgis_geog_2010nhgis/codebook_colleges_nhgis_census2010
Colleges Census 2020Single1980-2023nhgis/colleges_nhgis_geog_2020nhgis/codebook_colleges_nhgis_census2020

Codebooks are .xls files co-located with data in all mirrors. Use get_codebook_url() from fetch-patterns.md to construct download URLs.

Truth Hierarchy: When interpreting variable values, apply this priority:

  1. Actual data file (what you observe in the parquet/CSV) — this IS the truth
  2. Live codebook (.xls in mirror) — authoritative documentation, may lag
  3. This skill documentation — convenient summary, may drift from codebook

If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

Filtering

python
import polars as pl

# Filter to a specific school
school_census = df.filter(pl.col("ncessch") == 10000201704)

# Filter to metropolitan areas only (cbsa_type only in 2000+ census files)
metro = df.filter(pl.col("cbsa_type") == 1)

# Filter to a specific census region (South)
south = df.filter(pl.col("census_region") == 3)

# Filter to a specific year
recent = df.filter(pl.col("year") == 2023)

Note: The Portal provides pre-processed school/college-to-census-geography links. For custom census analysis (tract-level demographics, time series, boundary files), use NHGIS directly via methods in ./references/data-access.md (requires free IPUMS registration).

Common Pitfalls

PitfallIssueSolution
Boundary changesTracts split/merged between censuses break longitudinal analysisUse crosswalks or geographically standardized tables
ACS margins of errorSmall-area estimates have high uncertaintyCheck MOE; aggregate areas if needed
Block data limitationsOnly 100% count variables available (no income/poverty)Use block groups for sample data (ACS)
GISJOIN vs GEOIDDifferent ID formats cause join failuresUse GISJOIN for NHGIS joins, GEOID for Census Bureau joins
2020 Census noiseDifferential privacy added noise to small-area countsCheck for negative values; prefer ACS for detailed characteristics
Schools vs colleges schemaDifferent column counts (47 vs 38 for 2020) and identifier setsCheck schema before joining; do not assume identical structures
Census year schema driftEarlier census files have fewer columns (e.g., 1990 lacks CBSA/legislative fields)Check available columns per census year before relying on them
geocode_accuracy typeFloat64 in schools, Int64 in collegesCast to consistent type before cross-entity comparison
Using string codesPortal data uses integer encodings, not string labelsAlways verify codes against codebook (see encoding warning above)

Related Data Sources

SourceRelationshipWhen to Use
education-data-source-ccdSchool identifiers for linkingJoin school data to census geography via ncessch
education-data-source-saipeDistrict-level povertyUse SAIPE for district poverty; NHGIS for tract/block group poverty
education-data-source-mepsSchool-level povertyMEPS provides school-level poverty estimates; NHGIS provides community context
education-data-source-ipedsCollege identifiers for linkingJoin college data to census geography via unitid
education-data-explorerParent discovery skillFinding available endpoints
education-data-queryData fetchingDownloading parquet/CSV files

Topic Index

TopicReference File
Census tract definition./references/geographic-units.md
Block group definition./references/geographic-units.md
School district boundaries./references/geographic-units.md
School-to-tract linking./references/school-geography-links.md
SABINS attendance areas./references/school-geography-links.md
NCES EDGE files./references/school-geography-links.md
Time series tables./references/time-series.md
Geographic standardization./references/time-series.md
Geographic crosswalks./references/time-series.md
Population variables./references/variable-catalog.md
Income/poverty variables./references/variable-catalog.md
Education variables./references/variable-catalog.md
Tract boundary changes./references/boundary-changes.md
2022 Connecticut changes./references/boundary-changes.md
TIGER/Line versions./references/boundary-changes.md
Direct NHGIS access./references/data-access.md
ipumspy Python package./references/data-access.md
Data Finder workflow./references/data-access.md