AgentSkillsCN

joining-data

本技能介绍了如何将数据与 Our World In Data 的数据进行关联,例如,当你希望利用 OWID 人口数据计算人均数据,或希望基于 OWID 的 GDP 数据绘制人均 GDP 散点图时。此外,本技能也适用于其他场景:当多个数据源各自拥有涵盖多个国家的数据时,可将这些数据源整合在一起。

SKILL.md
--- frontmatter
name: "joining-data"
description: "This skill describes how to join data with Our World In Data data, e.g. when you want to calculate per capita data by using the OWID population data or when you want to create a scatter plot against GDP per capita using OWID's GDP data. It also applies to other cases where several data sources that each have data for multiple countries should be joined together."
allowed-tools:
- "Bash(duckdb:*)"

Our World in Data supplies a lot of important data at a national level for multiple years.

Joining data

When joining data from Our World In Data with that of other sources, the following should be taken into account:

  • OWID data usually comes as a dataframe with two dimensions: time (almost always the year) and entity (almost always the country and/or geographic region like continents or World)
  • OWID data uses harmonized country names and region codes (ISO alpha-3 for standard countries, custom codes for unusual regions like OWID_WRL for World)
  • If external data has ISO alpha-3 codes, use those for joining
  • If external data uses other identifiers (e.g. ISO alpha-2), map them to ISO alpha-3 first
  • OWID uses modern country borders for historical data (e.g. Italy population in 1 CE uses modern Italian borders)
  • Regional aggregates like "Europe" or "Sub-saharan Africa" usually differ between sources - do not join these by name
  • When external data lacks a year column, determine the correct year from documentation or ask the user, then join on that year

OWID data that is commonly used for joins

Some data that OWID provides, like population, is especially useful, for example to convert metrics into per-capita data. Recommendations for these are given below. To understand how to download data given a chart url, consult the fetch-chart-data skill.

For population there are two relevant time series.

  • the long-run population numbers, used in the chart https://ourworldindata.org/grapher/population that merges several data sources to provide data from 10.000 BCE to the present (up to one to three years ago). The precise temporal extent can be quickly queried by fetching the metadata for this chart, and reading $.columns.[0].timespan
  • the UN population data from 1950 with projections up to 2100 (medium scenario) used in the chart https://ourworldindata.org/grapher/population-with-un-projections . Use this if you need years more recent than what is available in the long-run population dataset or for the current or future years.

For GDP per capita, there are similarly two time series: