Research Topic Crawler
Collect raw resources (articles, charts, tables) for a financial research topic using the optimized Exa crawler.
Key Features:
- •Full HTML content with tags preserved
- •Automatic chart/image URL extraction and download
- •Organized by source domain
Usage
bash
/research-topic "seasonality in financial assets" /research-topic "FX macro correlations commodity currencies" /research-topic "momentum factor investing"
Workflow
Phase 1: Initialize Topic Folder
bash
cd /home/adesola/EpochDev/ClaudeCodeResearch source .venv/bin/activate python report_notes/scripts/init_topic.py "$TOPIC"
This creates:
code
report_notes/<topic_slug>/ ├── manifest.json ├── sources/ │ ├── quantpedia/ │ ├── academic/ │ ├── broker_research/ │ ├── market_data/ │ └── misc/ └── images/
Phase 2: Search and Crawl Sources
Use the optimized Exa crawler script which:
- •Preserves HTML tags (
includeHtmlTags: true) - •Extracts image URLs (
extras.imageLinks: 30) - •Downloads charts automatically
- •Organizes by domain
General Search (broad coverage):
bash
python report_notes/scripts/exa_crawler.py search "$TOPIC" --num 15 -o /tmp/search_results.json
Domain-Specific Search (academic/quantitative):
bash
python report_notes/scripts/exa_crawler.py search "$TOPIC academic research" --num 10 -o /tmp/academic_results.json
USDCAD/Oil Example:
bash
python report_notes/scripts/exa_crawler.py search "USDCAD crude oil WTI correlation" --num 10 -o /tmp/usdcad_results.json
Phase 3: Save Sources to Topic
Process search results and save with chart downloads:
python
import json
import sys
sys.path.insert(0, 'report_notes/scripts')
from exa_crawler import save_source_to_topic
from pathlib import Path
from datetime import datetime, timezone
# Load search results
with open('/tmp/search_results.json') as f:
results = json.load(f)
topic_dir = Path('report_notes/<topic_slug>')
# Save each source with chart downloads
for r in results:
crawl_result = {
'url': r['url'],
'title': r['title'],
'author': r.get('author'),
'text': r.get('text', ''),
'image': r.get('image'),
'imageLinks': r.get('imageLinks', []),
'links': r.get('links', []),
'crawled_at': datetime.now(timezone.utc).isoformat(),
'source': 'exa_search',
}
save_source_to_topic(topic_dir, crawl_result, download_images=True)
Phase 4: Crawl Specific URLs
For individual URLs not in search results:
bash
# Crawl single URL with full content python report_notes/scripts/exa_crawler.py crawl "<url>" --json -o /tmp/crawl_result.json # Then save to topic using Python
Phase 5: Show Status
bash
python -c "
import json
from pathlib import Path
m = json.load(open('report_notes/<topic_slug>/manifest.json'))
print(f'Sources: {m[\"stats\"][\"total_sources\"]}')
print(f'Charts: {m[\"stats\"][\"total_charts\"]}')
for s in m['sources']:
print(f' - [{len(s.get(\"charts\",[]))} charts] {s[\"title\"][:50]}...')
"
Exa Crawler Script Reference
Located at: report_notes/scripts/exa_crawler.py
Commands
| Command | Description |
|---|---|
search "<query>" --num N | Search and crawl N results |
crawl "<url>" | Crawl single URL with full content |
topic <slug> <urls...> | Crawl URLs and save to topic |
API Parameters Used
The script uses optimal Exa API parameters:
json
{
"text": {
"maxCharacters": 50000,
"includeHtmlTags": true
},
"extras": {
"imageLinks": 30,
"links": 20
},
"livecrawl": "preferred"
}
What Gets Extracted
| Field | Description |
|---|---|
text | Full content with HTML tags |
imageLinks | All image URLs (charts, figures) |
links | Outbound URLs for follow-up |
image | Main page image |
author | Article author |
Output Structure
code
report_notes/<topic_slug>/
├── manifest.json
├── sources/
│ ├── quantpedia/
│ │ ├── 001_strategy_name.md
│ │ └── 001_strategy_name/
│ │ └── charts/
│ │ ├── chart_01.png
│ │ └── chart_02.jpg
│ ├── academic/
│ ├── broker_research/
│ ├── market_data/
│ └── misc/
└── images/
└── chart_index.json
Manifest Format
json
{
"topic": "FX macro correlations",
"slug": "fx_macro_correlations",
"sources": [
{
"url": "https://...",
"domain": "broker_research",
"title": "USDCAD Oil Correlation",
"file": "sources/broker_research/001_usdcad_oil.md",
"charts": ["sources/.../charts/chart_01.png"],
"tables": 0
}
],
"stats": {
"total_sources": 15,
"total_charts": 62
}
}
Source Markdown Format
Each source is saved with frontmatter:
yaml
--- url: https://... title: Article Title domain: broker_research crawled_at: 2026-02-04T01:00:00+00:00 chart_count: 5 image_links: - https://example.com/chart1.png - https://example.com/chart2.jpg outbound_links: - https://related-article.com --- <h2>Article Content</h2> <p>Full HTML preserved...</p>
Domain Categories
| Domain | Folder | Sources |
|---|---|---|
| quantpedia.com | quantpedia/ | Strategy research |
| ssrn.com, nber.org, arxiv.org | academic/ | Academic papers |
| forex.com, oanda.com | broker_research/ | Broker analysis |
| barchart.com, tradingview.com | market_data/ | Charts, data |
| investopedia.com, babypips.com | educational/ | Tutorials |
| Other | misc/ | Everything else |
Tips
- •Start with broad search - Then narrow with domain-specific queries
- •Charts download automatically - Script filters logos/icons from real charts
- •Check manifest.json - Track progress and chart counts
- •HTML is preserved - Tables, links, formatting all retained
- •Image URLs in frontmatter - Even if download fails, URLs are saved