Process Data: EMA Product PDFs

Parse EMA Authorised Presentations PDF files into structured TSV format.

When to Use

When asked to parse an EMA product PDF, extract packaging data from a product folder, update parsed data, or fetch new/updated PDFs from EMA.

Input

•PDF files in numbered subfolders: data/ema/products/<ema_number>/<ema_number>_<product_name>_<date>.pdf
•Example: data/ema/products/006156/006156_stoboclo_2026-01-14.pdf

Output

•TSV file: parsed_data_<today>.tsv in the same subfolder as the PDF
•Example: data/ema/products/006156/parsed_data_2026-01-16.tsv

TSV Columns

Column	Description	PDF Source
ma_number	Marketing Authorization number	"MA (EU) number"
ema_product_number	EMA product identifier (format: `EMEA/H/C/XXXXXX`)	Derived from folder name
strength	Drug strength	"Strength"
pharmaceutical_form	Form of the medication	"Pharmaceutical Form"
route_of_administration	How the drug is administered	"Route of Administration"
packaging	Container type	"Immediate Packaging"
content	Volume/amount with concentration	"Content (concentration)"
pack_size	Number of units in pack	"Pack size"
product_name	Brand/invented name	"(Invented) name"

Fetching New Data

When asked to fetch, download, or update EMA data, run a single command:

bash

python3 scripts/fetch_ema_updates.py

This automatically:

•Detects the most recent PDF date from existing files
•Downloads the EMA medicines report and new/updated Authorised Presentations PDFs
•Generates ema-info.txt metadata for any new product folders

To override the auto-detected date cutoff:

bash

python3 scripts/fetch_ema_updates.py --since YYYY-MM-DD

After fetching, proceed to parse the new PDFs using the steps below.

Parsing Steps

•
Locate the PDF in its numbered subfolder under data/ema/products/
•
Extract the PDF date from the filename (e.g., 2026-01-14)
•
Check for existing TSV (parsed_data_*.tsv) in the same subfolder:
- •If TSV date >= PDF date: Skip - no processing needed
- •If no TSV exists or TSV date < PDF date: Continue
•
Read the PDF and extract the table data
•
Create/Update the TSV:
- •Filename: parsed_data_<today>.tsv
- •Header row with all column names
- •One row per product presentation
- •Tab delimiters, no trailing tabs or spaces
- •Delete old TSV file if updating

Example

Input PDF content:

code

MA (EU) number: EU/1/23/1727/001
(Invented) name: BEKEMV
Strength: 300 mg
Pharmaceutical Form: Concentrate for solution for infusion
Route of Administration: Intravenous use
Immediate Packaging: vial (glass)
Content (concentration): 30 ml (10 mg/ml)
Pack size: 1 vial

Output TSV:

code

ma_number	ema_product_number	strength	pharmaceutical_form	route_of_administration	packaging	content	pack_size	product_name
EU/1/23/1727/001	EMEA/H/C/005652	300 mg	Concentrate for solution for infusion	Intravenous use	vial (glass)	30 ml (10 mg/ml)	1 vial	BEKEMV

Scripts

All scripts are located in the scripts/ subdirectory of this skill and can be invoked from any directory in the project.

•scripts/fetch_ema_updates.py - One-command wrapper: detects latest date, downloads updates, generates metadata
•scripts/download_ema_presentation_files.py - Downloads the EMA medicines report and Authorised Presentations PDFs
•scripts/generate_ema_info.py - Generates ema-info.txt metadata files from medicines_report.tsv
•scripts/combine_tsv_files.py - Combines all per-product data and generates ema-to-rxnorm.tsv
•scripts/find_missing_files.py - Audits product folders for missing files
•scripts/list_pdfs_by_date.py - Lists PDFs sorted by date