Checking ChEMBL for Structured SAR Data
Overview
ChEMBL is a manually curated database of ~99,000 medicinal chemistry papers with extracted, standardized bioactivity data. If a paper is in ChEMBL, you can access structured data without parsing PDFs.
Core principle: Check ChEMBL first for medicinal chemistry papers. Curated data is more reliable than table parsing.
When to Use
Use this skill when:
- •Paper describes medicinal chemistry / drug discovery
- •Abstract mentions compound series, SAR, or activity data
- •Paper has IC50, MIC, Ki, EC50, or other bioactivity measurements
- •Before attempting to extract data from tables/figures
- •Paper scored ≥ 7 in relevance evaluation
When NOT to use:
- •Non-medicinal chemistry papers (cell biology, genomics, etc.)
- •Papers without activity measurements
- •Reviews without primary data
- •Very recent papers (< 6 months, likely not curated yet)
ChEMBL API Basics
Base URL: https://www.ebi.ac.uk/chembl/api/data/
No authentication required
CRITICAL: ChEMBL can ONLY be queried by DOI, NOT by PMID
- •The API returns PMID in results, but does not accept it as a query parameter
- •Always use DOI for lookups:
?doi=10.1234/example - •PMID queries will return 0 results even if paper exists in ChEMBL
Two-step process:
- •Check if paper (by DOI) is in ChEMBL
- •If yes, retrieve bioactivity data
Step 1: Check if Paper in ChEMBL
Query by DOI (ONLY method that works):
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"
⚠️ IMPORTANT: Must use DOI, not PMID
# ✅ CORRECT - Use DOI doi="10.1021/jm401507s" curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi" # ❌ WRONG - PMID won't work (will return 0 results) pmid="24446688" curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?pubmed_id=$pmid" # Does NOT work!
If you only have PMID: Fetch DOI from PubMed first, then query ChEMBL with the DOI.
Response structure:
{
"documents": [
{
"document_chembl_id": "CHEMBL3120156",
"doi": "10.1021/jm401507s",
"title": "Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor.",
"abstract": "Hepatitis C virus is a blood-borne infection...",
"pubmed_id": 24446688,
"journal": "J Med Chem",
"year": 2014,
"doc_type": "PUBLICATION"
}
],
"page_meta": {
"total_count": 1
}
}
Key fields:
- •
document_chembl_id- Use this to retrieve activity data - •
doc_type- "PUBLICATION" (from literature) or "DATASET" (deposited) - •
pubmed_id- PMID is in the response, but cannot be used to query ChEMBL - •If
total_count= 0, paper not in ChEMBL
Parse response:
response=$(curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi") if [ $(echo "$response" | jq -r '.page_meta.total_count') -gt 0 ]; then chembl_id=$(echo "$response" | jq -r '.documents[0].document_chembl_id') echo "✓ Found in ChEMBL: $chembl_id" else echo "✗ Not in ChEMBL" fi
Step 2: Get Activity Data Count
Query activity endpoint:
curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&limit=1"
Extract total count:
activity_url="https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=$chembl_id&limit=1" activity_count=$(curl -s "$activity_url" | jq -r '.page_meta.total_count') echo "→ $activity_count bioactivity data points"
Step 3: Report to User and Update Summary
Report immediately:
📄 [15/127] Screening: "Discovery and development of simeprevir" Abstract score: 9 → Fetching full text... ✓ ChEMBL: CHEMBL3120156 (101 activity data points) → IC50 data for HCV NS3 protease inhibitors available
Add to SUMMARY.md:
### [Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor](https://doi.org/10.1021/jm401507s) (Score: 9) **DOI:** [10.1021/jm401507s](https://doi.org/10.1021/jm401507s) **PMID:** [24446688](https://pubmed.ncbi.nlm.nih.gov/24446688/) **ChEMBL:** [CHEMBL3120156](https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3120156/) (101 data points) **Key Findings:** - IC50 data for HCV NS3/4A protease inhibitors (from ChEMBL) - Lead compound simeprevir (TMC435) approved for HCV treatment - Structures and full activity data: [ChEMBL API](https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156) **ChEMBL Activity Summary:** - IC50 values for HCV NS3/4A protease - PK parameters (AUC, Cmax, clearance) - DMPK assays (metabolic stability, permeability)
Always include ChEMBL status:
- •If found: Add ChEMBL ID with link and data point count
- •If not found: Note "Not in ChEMBL" (still valuable information)
Step 4: Update Tracking Files
Add to papers-reviewed.json:
{
"10.1021/jm401507s": {
"pmid": "24446688",
"status": "relevant",
"score": 9,
"chembl_id": "CHEMBL3120156",
"chembl_activities": 101,
"has_structured_data": true
}
}
Optional: Extract Structured Data
For papers with rich ChEMBL data (>20 activities), consider extracting:
# Get all IC50 data curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&standard_type=IC50&limit=100" > chembl_data.json # Summary statistics jq '[.activities[] | .standard_value | tonumber] | "Min: \(min), Max: \(max), Count: \(length)"' chembl_data.json
Report to user:
📊 ChEMBL data extracted: - IC50 values for HCV NS3/4A protease - All structures downloaded - Data saved to: chembl_CHEMBL3120156_ic50.json
Integration with Other Skills
During evaluating-paper-relevance workflow:
- •After abstract screening (score ≥7)
- •Before deep dive into full text
- •Check ChEMBL using this skill
- •If found:
- •Note ChEMBL ID in SUMMARY.md
- •Extract activity data (faster than PDF parsing)
- •Still fetch full text for methods, discussion, context
- •If not found:
- •Proceed with normal PDF evaluation
- •Parse tables manually if needed
Workflow integration point:
Stage 2: Deep Dive ├─ 1. Fetch Full Text (PMC → DOI → Unpaywall) ├─ 1.5. Check ChEMBL ← ADD THIS STEP │ ├─ Query by DOI │ ├─ If found: note ChEMBL ID + activity count │ └─ Report to user ├─ 2. Scan for Relevant Content └─ 3. Extract Findings
Common Activity Types in ChEMBL
| Type | Description | Units |
|---|---|---|
| IC50 | Half-maximal inhibitory concentration | nM, µM |
| MIC | Minimum inhibitory concentration | µg/mL, nM |
| Ki | Inhibition constant | nM, µM |
| EC50 | Half-maximal effective concentration | nM, µM |
| Kd | Dissociation constant | nM, µM |
| Potency | General potency measurement | Various |
Filter by activity type:
curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&standard_type=MIC"
ChEMBL Coverage
~99,000 documents (as of 2025)
Well represented:
- •Medicinal chemistry papers
- •SAR studies with compound series
- •Lead optimization campaigns
- •Papers in major journals (J Med Chem, Bioorg Med Chem, Eur J Med Chem, etc.)
Poorly represented:
- •Very recent papers (6-12 month curation lag)
- •Papers without extractable structures/activities
- •Non-drug-discovery research
- •Purely mechanistic studies
Typical hit rate:
- •~30-40% of medicinal chemistry papers
- •Higher for SAR-focused journals
Advantages of ChEMBL Data
vs. PDF table parsing:
- •✓ Structures already extracted (SMILES format)
- •✓ Units standardized (all IC50s in nM)
- •✓ Values validated and curated
- •✓ Machine-readable JSON
- •✓ No OCR errors
- •✓ Linked to assay protocols
- •✓ Queryable (filter by activity range, target, etc.)
When to still use PDF:
- •Full experimental procedures
- •Synthesis routes
- •Papers not in ChEMBL
- •Very recent papers
- •Context and interpretation
Progress Reporting
CRITICAL: Report ChEMBL check for every relevant paper
Example workflow report:
📄 [15/50] Screening: "Novel MmpL3 inhibitors..." Abstract score: 8 → Checking ChEMBL... ✓ ChEMBL: CHEMBL3456789 (34 data points) → Fetching full text... → Added to SUMMARY.md with ChEMBL link
For papers not in ChEMBL:
📄 [16/50] Screening: "Another paper..." Abstract score: 9 → Checking ChEMBL... ✗ Not in ChEMBL (likely too recent or review paper) → Fetching full text via Unpaywall...
Helper Script Pattern
For research sessions with many medicinal chemistry papers:
Create check_chembl.py:
#!/usr/bin/env python3
import requests
import json
import sys
def check_chembl(doi):
"""Check if DOI is in ChEMBL and return summary
IMPORTANT: Must use DOI, not PMID. ChEMBL API does not accept PMID queries.
"""
# Query document (ONLY works with DOI)
doc_url = f"https://www.ebi.ac.uk/chembl/api/data/document.json?doi={doi}"
try:
doc_response = requests.get(doc_url, timeout=10).json()
except:
return None
# Check if found
if doc_response.get('page_meta', {}).get('total_count', 0) == 0:
return {'in_chembl': False}
doc = doc_response['documents'][0]
chembl_id = doc['document_chembl_id']
# Get activity count
act_url = f"https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id={chembl_id}&limit=1"
try:
act_response = requests.get(act_url, timeout=10).json()
activity_count = act_response.get('page_meta', {}).get('total_count', 0)
except:
activity_count = 0
return {
'in_chembl': True,
'chembl_id': chembl_id,
'activity_count': activity_count,
'doc_type': doc.get('doc_type'),
'title': doc.get('title')
}
if __name__ == "__main__":
doi = sys.argv[1]
result = check_chembl(doi)
if result and result['in_chembl']:
print(f"✓ {result['chembl_id']} ({result['activity_count']} activities)")
else:
print("✗ Not in ChEMBL")
Usage:
python3 check_chembl.py "10.1021/jm401507s" # Output: ✓ CHEMBL3120156 (101 activities)
Common Mistakes
Querying by PMID: Using PMID instead of DOI → Always returns 0 results, ChEMBL only accepts DOI queries Skipping ChEMBL check: Not checking medicinal chemistry papers → Missing structured data that's already extracted Checking non-medchem papers: Checking genomics/cell biology papers → Wasting time, won't be in ChEMBL Not reporting status: Silent ChEMBL checks → User can't see what's happening Not adding to SUMMARY.md: Forgetting to include ChEMBL ID → Harder for user to access data later Only using ChEMBL: Not fetching full text when paper in ChEMBL → Missing context, methods, discussion Parsing PDFs when in ChEMBL: Manually extracting tables when structured data available → Wasting time and introducing errors
Quick Reference
| Task | Command |
|---|---|
| Check if DOI in ChEMBL | curl "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI" |
| Get activity count | curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1" |
| Get all activities | curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1000" |
| Filter by activity type | curl "...activity.json?document_chembl_id=ID&standard_type=MIC" |
| ChEMBL paper page | https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL_ID/ |
Permissions
Add to .claude/settings.local.json.template:
"Bash(curl*https://www.ebi.ac.uk/chembl/api/data/*)", "WebFetch(domain:www.ebi.ac.uk)"
Success Criteria
ChEMBL check successful when:
- •Every medicinal chemistry paper (score ≥7) checked
- •ChEMBL status reported to user immediately
- •ChEMBL ID added to SUMMARY.md (if found)
- •Activity count noted in summary
- •papers-reviewed.json updated with ChEMBL status
Next Steps
After checking ChEMBL:
- •If found: Consider extracting structured data for highly relevant papers (≥9)
- •Continue with full text evaluation for context
- •For papers not in ChEMBL: Proceed with normal PDF/table parsing
- •Update SUMMARY.md with all findings
Resources
- •Full Documentation: See
docs/CHEMBL_INTEGRATION.md - •ChEMBL API Docs: https://chembl.gitbook.io/chembl-interface-documentation/
- •ChEMBL Interface: https://www.ebi.ac.uk/chembl/