Process Data: EMA Product PDFs
Parse EMA Authorised Presentations PDF files into structured TSV format.
When to Use
When asked to parse an EMA product PDF, extract packaging data from a product folder, update parsed data, or fetch new/updated PDFs from EMA.
Input
- •PDF files in numbered subfolders:
data/ema/products/<ema_number>/<ema_number>_<product_name>_<date>.pdf - •Example:
data/ema/products/006156/006156_stoboclo_2026-01-14.pdf
Output
- •TSV file:
parsed_data_<today>.tsvin the same subfolder as the PDF - •Example:
data/ema/products/006156/parsed_data_2026-01-16.tsv
TSV Columns
| Column | Description | PDF Source |
|---|---|---|
| ma_number | Marketing Authorization number | "MA (EU) number" |
| ema_product_number | EMA product identifier (format: EMEA/H/C/XXXXXX) | Derived from folder name |
| strength | Drug strength | "Strength" |
| pharmaceutical_form | Form of the medication | "Pharmaceutical Form" |
| route_of_administration | How the drug is administered | "Route of Administration" |
| packaging | Container type | "Immediate Packaging" |
| content | Volume/amount with concentration | "Content (concentration)" |
| pack_size | Number of units in pack | "Pack size" |
| product_name | Brand/invented name | "(Invented) name" |
Fetching New Data
When asked to fetch, download, or update EMA data, run a single command:
python3 scripts/fetch_ema_updates.py
This automatically:
- •Detects the most recent PDF date from existing files
- •Downloads the EMA medicines report and new/updated Authorised Presentations PDFs
- •Generates
ema-info.txtmetadata for any new product folders
To override the auto-detected date cutoff:
python3 scripts/fetch_ema_updates.py --since YYYY-MM-DD
After fetching, proceed to parse the new PDFs using the steps below.
Parsing Steps
- •
Locate the PDF in its numbered subfolder under
data/ema/products/ - •
Extract the PDF date from the filename (e.g.,
2026-01-14) - •
Check for existing TSV (
parsed_data_*.tsv) in the same subfolder:- •If TSV date >= PDF date: Skip - no processing needed
- •If no TSV exists or TSV date < PDF date: Continue
- •
Read the PDF and extract the table data
- •
Create/Update the TSV:
- •Filename:
parsed_data_<today>.tsv - •Header row with all column names
- •One row per product presentation
- •Tab delimiters, no trailing tabs or spaces
- •Delete old TSV file if updating
- •Filename:
Example
Input PDF content:
MA (EU) number: EU/1/23/1727/001 (Invented) name: BEKEMV Strength: 300 mg Pharmaceutical Form: Concentrate for solution for infusion Route of Administration: Intravenous use Immediate Packaging: vial (glass) Content (concentration): 30 ml (10 mg/ml) Pack size: 1 vial
Output TSV:
ma_number ema_product_number strength pharmaceutical_form route_of_administration packaging content pack_size product_name EU/1/23/1727/001 EMEA/H/C/005652 300 mg Concentrate for solution for infusion Intravenous use vial (glass) 30 ml (10 mg/ml) 1 vial BEKEMV
Scripts
All scripts are located in the scripts/ subdirectory of this skill and can be invoked from any directory in the project.
- •
scripts/fetch_ema_updates.py- One-command wrapper: detects latest date, downloads updates, generates metadata - •
scripts/download_ema_presentation_files.py- Downloads the EMA medicines report and Authorised Presentations PDFs - •
scripts/generate_ema_info.py- Generatesema-info.txtmetadata files frommedicines_report.tsv - •
scripts/combine_tsv_files.py- Combines all per-product data and generatesema-to-rxnorm.tsv - •
scripts/find_missing_files.py- Audits product folders for missing files - •
scripts/list_pdfs_by_date.py- Lists PDFs sorted by date