AgentSkillsCN

gcs-data-catalog

当从GCS查询丹麦农业数据时自动激活。 适用于数据发现、查找数据集、理解数据结构, 查询Parquet文件、基于CVR/CHR/BFE标识符进行数据集关联。 关键词:数据、目录、数据集、GCS、Parquet、数据结构、查询、DuckDB、pyarrow

SKILL.md
--- frontmatter
name: gcs-data-catalog
description: |
  Activates when querying Danish agricultural data from GCS.
  Use this skill for: data discovery, finding datasets, understanding schemas,
  querying parquet files, joining datasets on CVR/CHR/BFE identifiers.
  Keywords: data, catalog, datasets, GCS, parquet, schema, query, DuckDB, pyarrow

GCS Data Catalog - Master Index

This skill provides immediate access to Landbruget.dk's GCS data lake containing 18+ Danish agricultural datasets.

Quick Access

GCS Bucket: Set via GCS_BUCKET environment variable (see .env)

Medallion Architecture:

  • bronze/ - Raw data exactly as received
  • silver/ - Cleaned, validated, standardized
  • gold/ - Analysis-ready, joined datasets

Setup Code

python
import os
import pyarrow.parquet as pq
from google.cloud import storage

# Initialize GCS client
client = storage.Client()
bucket_name = os.environ.get('GCS_BUCKET')  # Set in .env
bucket = client.bucket(bucket_name)

# Read parquet from GCS
def read_gcs_parquet(gcs_path: str):
    """Read parquet file from GCS path like 'silver/subsidies/*/data.parquet'"""
    import io
    blob = bucket.blob(gcs_path)
    buffer = io.BytesIO()
    blob.download_to_file(buffer)
    buffer.seek(0)
    return pq.read_table(buffer).to_pandas()

Data Categories (Frontend-Aligned)

CategoryDanish NameSkill PathKey JoinMetrics
FinanceØkonomigcs-data-catalog/okonomi/cvr_number3
Agricultural LandLandbrugsarealgcs-data-catalog/landbrugsareal/field_id, cvr_number4
EnvironmentMiljøgcs-data-catalog/miljo/geometry, field_id8
LivestockHusdyrgcs-data-catalog/husdyr/chr_number6
EmployeesMedarbejderegcs-data-catalog/medarbejdere/cvr_number5

Key Identifiers

IdentifierFormatDescriptionValidation
CVR8 digitsCompany registration number^\d{8}$
CHR6 digitsCentral Husbandry Register (herd ID)^\d{6}$
BFEVariableCadastral parcel numbervaries
field_idStringField identifier from FVMvaries
field_uuidUUIDUnique field identifierUUID format

Dataset Quick Reference

Økonomi (Finance)

DatasetPathRowsKey Columns
Subsidiessilver/subsidies/554Kcvr_number, tilskudsberetigt
CVR Enrichmentgold/cvr_enrichment/*/variescvr_number, company data
Property Ownerssilver/property_owners/8.2MCVRNummer, owner info

Landbrugsareal (Agricultural Land)

DatasetPathRowsKey Columns
FVM Marker (fields)silver/fvm_marker_{year}/617K/yearfield_id, cvr_number, crop_code, geometry
Field Productiongold/field_production_{year}/617K/yearfield_id, yield_estimate, crop_type
Agricultural Blockssilver/agricultural_blocks_{year}/variesblock_id, geometry
Cadastralsilver/cadastral/2.16Mbfe_number, geometry

Miljø (Environment)

DatasetPathRowsKey Columns
Pesticide Disaggregationgold/pesticide_disaggregation_{year}/1.52Mcvr_number, PesticideName, DosageQuantity
NLES5 Nitrogengold/nles5_nitrogen_*/500Kfield_id, nitrogen_washout_kg_ha
BNBO Statussilver/bnbo_status/5.4Kgeometry, status_bnbo
Wetlandssilver/wetlands/1.7Mgeometry, toerv_pct

Husdyr (Livestock)

DatasetPathRowsKey Columns
Svineflytningsilver/svineflytning/*/movements.parquet1.27Msender_chr_number, receiver_chr_number, total_animals
CHR Movementsbronze/chr/*/chr_dyr_movement_summaries.parquet124Kreporting_herd_number, animal_count
Animal Welfaresilver/animal welfare/varieschr_number

Medarbejdere (Employees)

DatasetPathRowsKey Columns
Arbejdstilsynetgold/arbejdstilsynet_inspections/536cvr_number, decision, severity_score
Work Permitssilver/work permits/variescvr_number
Worker Safetysilver/worker safety/variescvr_number

Common Queries

List Available Years for a Dataset

bash
gsutil ls gs://$GCS_BUCKET/silver/fvm_marker_*/

Check Dataset Schema

python
import os
import pyarrow.parquet as pq
from google.cloud import storage
import io

client = storage.Client()
bucket_name = os.environ.get('GCS_BUCKET')
bucket = client.bucket(bucket_name)

# Get first parquet file and read schema
blob = bucket.blob('silver/subsidies/2025-01-10T00:00:26.377177/data.parquet')
buffer = io.BytesIO()
blob.download_to_file(buffer)
buffer.seek(0)
schema = pq.read_schema(buffer)
print(schema)

Query Specific CVR

python
df = read_gcs_parquet('silver/subsidies/2025-01-10T00:00:26.377177/data.parquet')
company_data = df[df['cvr_number'] == '31373077']

Cross-Dataset Joins

CVR-based joins (most common)

python
# Join subsidies with pesticides on CVR
subsidies = read_gcs_parquet('silver/subsidies/*/data.parquet')
pesticides = read_gcs_parquet('gold/pesticide_disaggregation_2024/*/data.parquet')
merged = subsidies.merge(pesticides, on='cvr_number', how='inner')

Field-based joins

python
# Join field production with nitrogen estimates
field_prod = read_gcs_parquet('gold/field_production_2024/*/data.parquet')
nitrogen = read_gcs_parquet('gold/nles5_nitrogen_2024/*/data.parquet')
merged = field_prod.merge(nitrogen, on=['field_id', 'cvr_number'], how='inner')

CHR-based joins

python
# Join movements with animal welfare
movements = read_gcs_parquet('silver/svineflytning/*/movements.parquet')
welfare = read_gcs_parquet('silver/animal welfare/*/data.parquet')
# Join on sender or receiver CHR

Data Update Schedule

LayerFrequencyNotes
BronzeWeekly (Mondays 2AM UTC)Immutable, timestamped
SilverAfter bronze updateCleaned, validated
GoldAfter silver updateAnalysis-ready

Related Skills

  • okonomi/ - Financial data: subsidies, property values
  • landbrugsareal/ - Field and crop data: FVM marker, production
  • miljo/ - Environmental data: pesticides, nitrogen, BNBO
  • husdyr/ - Livestock data: CHR, movements, welfare
  • medarbejdere/ - Employee data: inspections, safety

Troubleshooting

Authentication

bash
# Check GCS access
gcloud auth application-default login
gsutil ls gs://$GCS_BUCKET/

Large Files

For datasets > 1GB, use DuckDB or chunked reading:

python
import duckdb
# Query directly without loading into memory
result = duckdb.query("""
    SELECT cvr_number, SUM(area_ha) as total_area
    FROM 'gs://$GCS_BUCKET/gold/field_production_2024/*/data.parquet'
    GROUP BY cvr_number
""").df()

CRS Conversion

All geometry is stored in EPSG:4326 (WGS84). For Danish coordinates (EPSG:25832):

python
import geopandas as gpd
gdf = gdf.to_crs('EPSG:25832')  # Convert to UTM 32N