Debug Scraping Issues
Systematic diagnosis of web scraping failures in Supacrawl.
When This Skill Activates
- •Scraping returns empty or incomplete content
- •Timeout errors during page load
- •Anti-bot detection or CAPTCHA challenges
- •JavaScript content not rendering
- •Unexpected HTTP errors (403, 429, 503)
- •Content structure doesn't match expectations
Diagnostic Process
Step 1: Reproduce the Issue
First, reproduce with debug logging enabled:
SUPACRAWL_LOG_LEVEL=DEBUG supacrawl scrape "URL" --format markdown
Capture:
- •The exact error message
- •The URL being scraped
- •Any correlation ID in the error
Step 2: Categorise the Failure
| Symptom | Likely Cause | Jump To |
|---|---|---|
| Empty markdown output | JS not rendered, content in iframe | Step 3a |
| Timeout error | Slow page, wait strategy wrong | Step 3b |
| 403/Access Denied | Anti-bot detection | Step 3c |
| 429 Too Many Requests | Rate limiting | Step 3d |
| Connection refused | Network/proxy issue | Step 3e |
| Wrong content extracted | Selector/conversion issue | Step 3f |
Step 3a: JavaScript Rendering Issues
Symptoms: Empty content, "Loading..." text, missing dynamic elements
Diagnosis:
# Try with longer wait supacrawl scrape "URL" --wait-for 5000 # Try networkidle wait strategy supacrawl scrape "URL" --wait-until networkidle
Check:
- •Does the site require JavaScript? View source vs rendered DOM
- •Is content loaded via XHR/fetch after page load?
- •Is content in an iframe?
Fixes:
- •Increase
--wait-fortime for slow JS - •Use
--wait-until networkidlefor XHR-heavy sites - •Check if content is in iframe (Playwright won't cross iframe boundaries by default)
Step 3b: Timeout Issues
Symptoms: "Timeout waiting for page", operation cancelled
Diagnosis:
# Check with extended timeout supacrawl scrape "URL" --timeout 60000
Check:
- •Is the site actually slow or unresponsive?
- •Is there a redirect chain?
- •Is the network stable?
Fixes:
- •Increase timeout:
--timeout 60000(60 seconds) - •Use
--wait-until loadinstead ofnetworkidlefor sites that never stop loading - •Check for infinite redirect loops
Step 3c: Anti-Bot Detection
Symptoms: 403 Forbidden, CAPTCHA page, "Access Denied", Cloudflare challenge
Diagnosis:
# Try with stealth mode supacrawl scrape "URL" --stealth # Check what the bot sees supacrawl scrape "URL" --format rawHtml | head -100
Check:
- •Does the raw HTML show a CAPTCHA or challenge page?
- •Is Cloudflare/Akamai/PerimeterX protection active?
- •Are browser fingerprints being detected?
Fixes:
- •Enable stealth mode:
--stealth(uses Patchright) - •Slow down requests if scraping multiple pages
- •Some sites require human verification - these cannot be scraped automatically
Step 3d: Rate Limiting
Symptoms: 429 errors, temporary blocks, "Too Many Requests"
Diagnosis:
- •Check if error occurs on first request or after multiple
- •Check response headers for rate limit info
Fixes:
- •Add delays between requests when crawling
- •Respect
Retry-Afterheaders - •Reduce concurrency
Step 3e: Network Issues
Symptoms: Connection refused, DNS resolution failed, SSL errors
Diagnosis:
# Test basic connectivity curl -I "URL" # Check DNS dig domain.com
Check:
- •Is the site actually accessible?
- •Is there a proxy configuration issue?
- •Are there SSL certificate problems?
Fixes:
- •Verify URL is correct and site is up
- •Check proxy settings if using one
- •For SSL issues, check certificate validity
Step 3f: Content Extraction Issues
Symptoms: Content extracted but wrong/incomplete, formatting broken
Diagnosis:
# Get raw HTML to inspect supacrawl scrape "URL" --format rawHtml > page.html # Compare with markdown supacrawl scrape "URL" --format markdown > page.md
Check:
- •Is the content present in raw HTML?
- •Is the markdown converter handling it correctly?
- •Are there encoding issues?
Fixes:
- •Check if
only_main_contentis excluding desired content - •Look for unusual HTML structures that confuse the converter
- •Check for encoding issues in source
Code Investigation
If the issue is in Supacrawl itself, investigate:
| Component | Location | Purpose |
|---|---|---|
| Browser management | src/supacrawl/services/browser.py | Playwright lifecycle, page fetching |
| Content extraction | src/supacrawl/services/scrape.py | Main scrape logic |
| Markdown conversion | src/supacrawl/services/converter.py | HTML to Markdown |
| Stealth mode | Uses Patchright instead of Playwright | Anti-detection |
Common Patterns
Site Uses Heavy JavaScript
supacrawl scrape "URL" --wait-for 5000 --wait-until networkidle
Site Has Anti-Bot Protection
supacrawl scrape "URL" --stealth
Site Is Slow
supacrawl scrape "URL" --timeout 60000 --wait-until load
Need to Debug What Browser Sees
supacrawl scrape "URL" --format screenshot --output debug.png
Escalation
If none of the above resolves the issue:
- •Check GitHub Issues: Similar problem may be reported
- •Capture debug output:
SUPACRAWL_LOG_LEVEL=DEBUGwith full output - •Test in real browser: Does the URL work in Chrome/Firefox?
- •Create minimal reproduction: Single URL that demonstrates the issue