AgentSkillsCN

process-ema-data

将 EMA 批准的药品说明书 PDF 文件解析为结构化的 TSV 格式。适用于被要求解析、提取或处理原始 EMA 产品 PDF 文件,或获取/下载新的 EMA 数据时使用。

SKILL.md
--- frontmatter
name: process-ema-data
description: Parse EMA Authorised Presentations PDF files into structured TSV format. Use when asked to parse, extract, or process raw EMA product PDFs, or to fetch/download new EMA data.

Process Data: EMA Product PDFs

Parse EMA Authorised Presentations PDF files into structured TSV format.

When to Use

When asked to parse an EMA product PDF, extract packaging data from a product folder, update parsed data, or fetch new/updated PDFs from EMA.

Input

  • PDF files in numbered subfolders: data/ema/products/<ema_number>/<ema_number>_<product_name>_<date>.pdf
  • Example: data/ema/products/006156/006156_stoboclo_2026-01-14.pdf

Output

  • TSV file: parsed_data_<today>.tsv in the same subfolder as the PDF
  • Example: data/ema/products/006156/parsed_data_2026-01-16.tsv

TSV Columns

ColumnDescriptionPDF Source
ma_numberMarketing Authorization number"MA (EU) number"
ema_product_numberEMA product identifier (format: EMEA/H/C/XXXXXX)Derived from folder name
strengthDrug strength"Strength"
pharmaceutical_formForm of the medication"Pharmaceutical Form"
route_of_administrationHow the drug is administered"Route of Administration"
packagingContainer type"Immediate Packaging"
contentVolume/amount with concentration"Content (concentration)"
pack_sizeNumber of units in pack"Pack size"
product_nameBrand/invented name"(Invented) name"

Fetching New Data

When asked to fetch, download, or update EMA data, run a single command:

bash
python3 scripts/fetch_ema_updates.py

This automatically:

  1. Detects the most recent PDF date from existing files
  2. Downloads the EMA medicines report and new/updated Authorised Presentations PDFs
  3. Generates ema-info.txt metadata for any new product folders

To override the auto-detected date cutoff:

bash
python3 scripts/fetch_ema_updates.py --since YYYY-MM-DD

After fetching, proceed to parse the new PDFs using the steps below.

Parsing Steps

  1. Locate the PDF in its numbered subfolder under data/ema/products/

  2. Extract the PDF date from the filename (e.g., 2026-01-14)

  3. Check for existing TSV (parsed_data_*.tsv) in the same subfolder:

    • If TSV date >= PDF date: Skip - no processing needed
    • If no TSV exists or TSV date < PDF date: Continue
  4. Read the PDF and extract the table data

  5. Create/Update the TSV:

    • Filename: parsed_data_<today>.tsv
    • Header row with all column names
    • One row per product presentation
    • Tab delimiters, no trailing tabs or spaces
    • Delete old TSV file if updating

Example

Input PDF content:

code
MA (EU) number: EU/1/23/1727/001
(Invented) name: BEKEMV
Strength: 300 mg
Pharmaceutical Form: Concentrate for solution for infusion
Route of Administration: Intravenous use
Immediate Packaging: vial (glass)
Content (concentration): 30 ml (10 mg/ml)
Pack size: 1 vial

Output TSV:

code
ma_number	ema_product_number	strength	pharmaceutical_form	route_of_administration	packaging	content	pack_size	product_name
EU/1/23/1727/001	EMEA/H/C/005652	300 mg	Concentrate for solution for infusion	Intravenous use	vial (glass)	30 ml (10 mg/ml)	1 vial	BEKEMV

Scripts

All scripts are located in the scripts/ subdirectory of this skill and can be invoked from any directory in the project.

  • scripts/fetch_ema_updates.py - One-command wrapper: detects latest date, downloads updates, generates metadata
  • scripts/download_ema_presentation_files.py - Downloads the EMA medicines report and Authorised Presentations PDFs
  • scripts/generate_ema_info.py - Generates ema-info.txt metadata files from medicines_report.tsv
  • scripts/combine_tsv_files.py - Combines all per-product data and generates ema-to-rxnorm.tsv
  • scripts/find_missing_files.py - Audits product folders for missing files
  • scripts/list_pdfs_by_date.py - Lists PDFs sorted by date