Web Scraping Pro
Advanced scraping that handles modern anti-bot protections.
When to Use
- •Regular
web_fetchreturns 403/Cloudflare - •Need to scrape JavaScript-rendered content
- •Want structured data extraction (not just HTML)
- •Dealing with bot protection (Cloudflare, Akamai)
Quick Start
Option 1: Jina Reader (Easiest, Free)
bash
# Just prepend r.jina.ai to any URL curl "https://r.jina.ai/https://example.com" # Or use the script node skills/web-scraping-pro/scripts/jina-fetch.js "https://example.com"
Option 2: Crawl4AI (Most Powerful, Free)
bash
# Install pip install crawl4ai playwright install # Use via script python skills/web-scraping-pro/scripts/crawl4ai-fetch.py "https://example.com"
Option 3: Firecrawl (API, has free tier)
bash
# Set API key export FIRECRAWL_API_KEY=your_key # Fetch node skills/web-scraping-pro/scripts/firecrawl-fetch.js "https://example.com"
Available Scripts
scripts/jina-fetch.js
Fetch via Jina Reader API. Free, no setup, handles most sites.
- •✅ Free unlimited
- •✅ Returns clean markdown
- •❌ Some sites block
scripts/crawl4ai-fetch.py
Full browser scraping with Crawl4AI. Most capable option.
- •✅ Free, local
- •✅ Handles JS rendering
- •✅ Built-in extraction
- •❌ Slower, needs setup
scripts/firecrawl-fetch.js
Firecrawl API for reliable scraping.
- •✅ Very reliable
- •✅ Clean output
- •❌ Paid after free tier
scripts/multi-fetch.js
Try multiple methods in fallback order.
scripts/extract-structured.py
Extract structured data (tables, lists, specific elements).
Jina Reader Features
bash
# Basic fetch (returns markdown) curl "https://r.jina.ai/https://example.com" # Search (returns search results) curl "https://s.jina.ai/your+search+query" # With options curl "https://r.jina.ai/https://example.com" \ -H "X-Return-Format: text" \ -H "X-Target-Selector: main"
Crawl4AI Features
python
from crawl4ai import WebCrawler
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url="https://example.com",
word_count_threshold=10,
extraction_strategy="LLMExtractionStrategy", # AI extraction
chunking_strategy="RegexChunking"
)
print(result.markdown) # Clean markdown
print(result.extracted_content) # Structured data
Comparison
| Tool | Cloudflare | JS Render | Speed | Cost |
|---|---|---|---|---|
| Jina Reader | ⭕ Some | ✅ Yes | Fast | Free |
| Crawl4AI | ✅ Yes | ✅ Yes | Slow | Free |
| Firecrawl | ✅ Yes | ✅ Yes | Fast | Freemium |
| Browser skill | ✅ Yes | ✅ Yes | Slow | Free |
Tips
- •Start with Jina - It's free and works for most sites
- •Fallback to Crawl4AI - When Jina fails
- •Use selectors - Target specific elements for cleaner output
- •Rate limit - Add delays between requests
- •Cache results - Don't re-scrape unchanged content
Handling Common Issues
Cloudflare Challenge
bash
# Use Crawl4AI with stealth python skills/web-scraping-pro/scripts/crawl4ai-fetch.py "https://site.com" --stealth
JavaScript Content
bash
# Jina handles most JS # For complex SPAs, use Crawl4AI with wait python skills/web-scraping-pro/scripts/crawl4ai-fetch.py "https://spa.com" --wait 5000
Rate Limiting
bash
# Use batch script with delays node skills/web-scraping-pro/scripts/batch-scrape.js urls.txt --delay 3000