Scientific Data Extraction Skill
Overview
This skill provides comprehensive guidance for extracting structured data from scientific literature across multiple input formats (PDF, HTML, images, plain text). It auto-detects the scientific domain to recommend specialized tools when appropriate (particularly for chemistry and materials science) and employs a hierarchical extraction approach with multi-method validation for high-confidence results.
When to Use This Skill
Use this skill when you need to:
- •Extract numerical data from scientific papers, reports, or documents
- •Digitize graphs and plots to recover underlying data points
- •Parse tables from PDFs or images into structured formats (CSV, DataFrame, JSON)
- •Extract chemical/materials data including properties, reactions, compounds, and structures
- •Convert unstructured text to structured JSON or tabular formats
- •Validate extracted data through multi-method cross-checking
- •Process document batches with consistent extraction methodology
Input Format Detection
The first step is identifying the input format and routing to appropriate tools:
Plain Text (.txt, .md)
- •Domain detection via keyword analysis
- •NLP-based entity extraction (spaCy, Stanza)
- •Regex patterns for structured data (numbers with units, chemical formulas)
- •LLM-based structured extraction
HTML (.html, web pages)
- •HTML parsing with BeautifulSoup + lxml
- •Table detection and extraction
- •Text content extraction with structure preservation
- •Domain-specific processing after text extraction
PDF (.pdf)
| Priority | Tool | Speed | Use Case |
|---|---|---|---|
| Quick | PyMuPDF4LLM | ~0.12s | Initial exploration, large batches |
| Standard | GROBID | Medium | Research-grade, reference parsing |
| Standard | Docling | Medium | Layout-aware, complex documents |
| Tables | Camelot | Fast | Bordered tables |
| Tables | Tabula | Fast | General tables |
| Tables | pdfplumber | Medium | Complex table structures |
| Deep | Marker-PDF | Slower | Scanned documents with OCR |
Images (.png, .jpg, .tiff)
| Content Type | Recommended Approach |
|---|---|
| Document scan | OCR (Tesseract/Surya) then text pipeline |
| Graph/Plot | WebPlotDigitizer workflow or LLM vision |
| Table image | Table Transformer or LLM vision |
| Chemical structure | OSRA or DECIMER for SMILES conversion |
Domain Detection
The skill automatically detects scientific domain to apply specialized tools:
Chemistry/Materials Indicators
- •Chemical formulas (H2O, NaCl, TiO2)
- •SMILES strings, InChI identifiers
- •Reaction arrows (→, ⟶, ⇌)
- •Property keywords: melting point, bandgap, conductivity, yield, purity
- •Material names and IUPAC nomenclature
- •Spectroscopic data patterns (NMR shifts, IR peaks)
When Chemistry/Materials Detected
Apply specialized tools:
- •ChemDataExtractor v2: Property extraction, entity recognition, table parsing
- •OpenChemIE: Reaction extraction from text, tables, and figures
- •Domain-specific NER: Chemical named entity recognition
General Scientific Domain
Use general-purpose extraction:
- •Standard NLP pipelines
- •LLM-based structured extraction
- •Template-based parsing
Extraction Method Hierarchy
Apply methods in order of increasing complexity based on requirements:
Level 1: Quick Extraction (Speed Priority)
When to use: Initial exploration, large document batches, simple structured data
# Quick PDF to text with PyMuPDF4LLM
import pymupdf4llm
text = pymupdf4llm.to_markdown("paper.pdf")
# Quick HTML parsing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')
tables = soup.find_all('table')
Expected confidence: Lower, suitable for screening
Level 2: Standard Extraction (Balanced)
When to use: Research-grade extraction, structure preservation needed
# GROBID for structured PDF parsing
import scipdf_parser
article = scipdf_parser.parse_pdf_to_dict("paper.pdf")
# Docling for layout-aware extraction
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("paper.pdf")
# Camelot for bordered tables
import camelot
tables = camelot.read_pdf("paper.pdf", flavor='lattice')
df = tables[0].df
Expected confidence: Medium-high
Level 3: Deep Extraction (Accuracy Priority)
When to use: Publication-quality data, domain-specific extraction
# ChemDataExtractor for chemistry documents
from chemdataextractor import Document
doc = Document.from_file("paper.pdf")
records = doc.records
# OpenChemIE for reaction extraction
from openchemie import OpenChemIE
model = OpenChemIE()
reactions = model.extract_reactions_from_text(text)
# Marker-PDF with OCR for scanned documents
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
result = converter("scanned_paper.pdf")
Expected confidence: High
Level 4: LLM-Enhanced Extraction
When to use: Complex figures, ambiguous data, validation needed
# LLM-based structured extraction
prompt = """
Extract all numerical data from this text as JSON:
- Property name
- Value (number only)
- Unit
- Context (what material/compound)
Text: {text}
"""
# LLM vision for graph interpretation
prompt = """
Analyze this graph image and extract:
1. X-axis label and range
2. Y-axis label and range
3. All data points as (x, y) pairs
4. Any error bars or uncertainty indicators
"""
Expected confidence: Highest when combined with validation
Multi-Method Validation Pipeline
For high-confidence results, use multiple extraction methods and validate:
Step 1: Primary Extraction
Select method based on input type and domain, extract structured data.
Step 2: Secondary Extraction
Run alternative method on same source, compare results and flag discrepancies.
Step 3: LLM Verification Queries
Ask targeted questions to verify extracted data:
- •"Is this value X consistent with the context Y?"
- •"Does unit Z make sense for property P?"
- •"Are there any missing data points in the expected range?"
Step 4: Confidence Scoring
confidence = {
"score": 0.0, # 0-1 scale
"level": "HIGH|MEDIUM|LOW|REVIEW",
"methods_agreed": [], # List of methods that produced same result
"discrepancies": [], # Any disagreements between methods
"verification_notes": "" # LLM verification outcome
}
# Scoring rules:
# - Single method: max 0.7
# - Two methods agree: 0.8
# - Two methods + LLM verification: 0.9
# - Multiple methods + LLM + database cross-reference: 0.95+
Step 5: Database Cross-Reference (Optional)
For chemistry/materials, compare against known databases:
- •Materials Project
- •AFLOW
- •PubChem
- •NIST databases
Flag significant deviations from expected ranges.
Output Format
Structure extracted data consistently:
{
"extraction_metadata": {
"source": "path/to/document.pdf",
"source_type": "pdf",
"domain_detected": "chemistry",
"methods_used": ["grobid", "chemdataextractor", "llm_verification"],
"timestamp": "2025-01-18T..."
},
"extracted_data": [
{
"data_type": "material_property",
"entity": "TiO2",
"property": "bandgap",
"value": 3.2,
"unit": "eV",
"source_location": {
"page": 4,
"section": "Results",
"table_id": "Table 2",
"row": 3
},
"confidence": {
"score": 0.95,
"level": "HIGH",
"methods_agreed": ["chemdataextractor", "llm_extraction"],
"verification_notes": "Value consistent with literature range 3.0-3.4 eV"
}
}
],
"validation_summary": {
"total_extracted": 47,
"high_confidence": 38,
"medium_confidence": 7,
"needs_review": 2,
"discrepancies": []
}
}
Step-by-Step Instructions
For PDF Data Extraction
- •Identify document type: Scanned or text-based PDF
- •Choose extraction level: Based on accuracy requirements
- •Detect domain: Check for chemistry/materials indicators
- •Extract text/structure: Use appropriate tool from hierarchy
- •Extract tables separately: Use Camelot, Tabula, or pdfplumber
- •Apply domain tools: If chemistry detected, use ChemDataExtractor
- •Validate: Run secondary extraction or LLM verification
- •Format output: Structure as JSON with confidence scores
For Graph/Plot Digitization
- •Assess graph quality: Resolution, clarity, labeling
- •Identify graph type: Line plot, scatter, bar chart, contour
- •Choose method:
- •Simple, clear graphs: WebPlotDigitizer (manual calibration)
- •Complex or batch: LLM vision extraction
- •Calibrate axes: Define coordinate system
- •Extract data points: Manual selection or automatic detection
- •Validate: Check extracted points against visual inspection
- •Export: CSV or JSON format with uncertainty estimates
For Table Extraction
- •Identify table type: Bordered (lattice) or borderless (stream)
- •Choose tool:
- •Bordered: Camelot with
flavor='lattice' - •Borderless: Tabula or Camelot with
flavor='stream' - •Complex: pdfplumber for fine-grained control
- •Bordered: Camelot with
- •Extract to DataFrame: Review structure and headers
- •Clean data: Fix merged cells, missing values, formatting
- •Apply domain parsing: Convert units, parse chemical formulas
- •Validate: Compare against source visually
- •Export: CSV, JSON, or integrate into dataset
For Chemistry/Materials Extraction
- •Confirm domain: Verify chemistry/materials content
- •Choose specialized tool:
- •Properties: ChemDataExtractor v2
- •Reactions: OpenChemIE
- •Structures from images: OSRA or DECIMER
- •Configure extraction: Set up parsers for target properties
- •Run extraction: Process document with domain tools
- •Post-process: Normalize units, standardize identifiers
- •Cross-reference: Compare against databases (Materials Project, PubChem)
- •Validate: LLM verification of unusual values
- •Export: Structured JSON with confidence scores
Best Practices
- •
Always start with format detection - Correct tool selection depends on accurate format identification
- •
Use the simplest method that works - Start at Level 1 and escalate only if needed
- •
Preserve source location - Track page numbers, sections, table IDs for traceability
- •
Validate unusual values - Any value outside expected ranges should be flagged and verified
- •
Document extraction methodology - Record which tools and settings produced each data point
- •
Handle uncertainty explicitly - Include error bounds when available, note when values are approximate
- •
Cross-reference chemistry data - Always compare against known databases for sanity checking
- •
Use LLM verification judiciously - Most valuable for complex figures and ambiguous cases
Requirements
Core Python Packages
- •
pymupdf4llm: Quick PDF extraction - •
pdfplumber: Detailed PDF analysis - •
camelot-py: Table extraction (requires ghostscript) - •
beautifulsoup4,lxml: HTML parsing - •
spacy: NLP processing - •
pandas: Data manipulation
Domain-Specific (Chemistry)
- •
chemdataextractor: Chemistry NLP (v2 recommended) - •
openchemie: Reaction extraction
Optional
- •
tabula-py: Table extraction (requires Java) - •
grobid(server): Academic PDF parsing - •
docling: IBM document converter - •
marker-pdf: OCR-capable PDF conversion - •
tesseractorsurya: OCR engines
Limitations
- •
Scanned documents require OCR - Quality depends on scan resolution and OCR accuracy
- •
Complex table structures - Merged cells, nested headers may require manual correction
- •
Graph digitization is approximate - Precision limited by image resolution and calibration
- •
Domain tools are specialized - Chemistry tools won't work well on biology or physics texts
- •
LLM extraction can hallucinate - Always validate with source or alternative method
- •
Some PDFs are protected - May not be extractable due to DRM or image-only content
Related Skills
- •literature-review: For systematic literature searching and synthesis
- •scientific-reviewer: For evaluating extracted data quality
- •materials-databases: For cross-referencing extracted chemistry/materials data
- •python-plotting: For visualizing extracted data
References
See the references/ directory for detailed documentation on:
- •
pdf-tools.md: Comprehensive PDF extraction tool comparison - •
table-extraction.md: Table extraction methods and code examples - •
graph-digitization.md: Graph data extraction techniques - •
chemistry-tools.md: ChemDataExtractor and OpenChemIE usage - •
llm-extraction.md: LLM-based extraction patterns and validation
See the examples/ directory for complete workflows:
- •
extract-from-pdf.md: End-to-end PDF extraction example - •
extract-table-data.md: Table extraction comparison - •
digitize-graph.md: Graph digitization guide - •
chemistry-extraction.md: Chemistry-specific extraction workflow