Anti-Scraping & Web Scraping
When to use: Websites with Cloudflare protection, JavaScript rendering requirements, or anti-bot measures.
Overview
Provides battle-tested solutions for bypassing common anti-scraping measures using Playwright headless browser with stealth configurations.
Key Capabilities
- •✅ Cloudflare challenge bypass
- •✅ JavaScript rendering
- •✅ Real browser context simulation
- •✅ Stealth mode (hides automation detection)
- •✅ Screenshot capture for debugging
Quick Start
Prerequisites
bash
# Install Playwright npm install -g playwright playwright install chromium
Basic Usage Pattern
javascript
// n8n Execute Command node
const { execSync } = require('child_process');
const url = 'https://example.com';
const outputFile = '/tmp/page.html';
// Playwright command with stealth
const command = `node playwright-cloudflare.js "${url}" "${outputFile}"`;
execSync(command);
// Read result
const html = fs.readFileSync(outputFile, 'utf8');
Core Script: playwright-cloudflare.js
Location: n8n-skills/anti-scraping/playwright-cloudflare.js
Key Features:
- •Disables automation detection
- •Sets real browser headers
- •Configures viewport and user agent
- •Handles Cloudflare waiting
- •Captures screenshots on failure
Configuration:
javascript
const config = {
waitForCloudflare: true, // Wait for CF challenge
waitTime: 15000, // Max wait time (ms)
selector: '.product-list', // Element to wait for
screenshotOnError: true, // Debug screenshots
userAgent: 'Mozilla/5.0...' // Real browser UA
};
n8n Workflow Pattern
code
[Manual Trigger]
↓
[Set Parameters]
target_url: https://site.com
wait_selector: .content
↓
[Execute Command: Playwright]
Command: node
Arguments: playwright-cloudflare.js {{$json.target_url}} /tmp/output.html
↓
[Read HTML File]
File: /tmp/output.html
↓
[Parse with Cheerio]
(use html-parsing skill)
Performance
- •Speed: 15-25 seconds per page
- •Success Rate: ~95% for Cloudflare sites
- •Resource Usage: ~200-300MB RAM per browser instance
Troubleshooting
Cloudflare Still Blocking
bash
# Increase wait time --wait 30000 # Add specific selector to wait for --selector '.product-list' # Check screenshot for errors /tmp/error-screenshot.png
Timeout Errors
bash
# Increase timeout in playwright script timeout: 60000 // 60 seconds
Memory Issues
bash
# Close browser properly await browser.close(); # Limit concurrent instances # Use n8n Split Into Batches with batch size = 1
Best Practices
- •Add Delays: Wait 3-5 seconds between requests
- •Rotate User Agents: Change UA periodically
- •Use Residential Proxies: For high-volume scraping
- •Handle Errors: Implement retry logic with exponential backoff
- •Respect robots.txt: Check site policies
Common Patterns
Pattern 1: Single Page Scraping
code
Trigger → Playwright → Parse → Export
Pattern 2: Multi-Page with Pagination
code
Trigger → Generate URLs (pagination skill) → Split Into Batches → Playwright → Wait 5s → Parse → Deduplicate → Export
Pattern 3: With Error Handling
code
Playwright → [Error Trigger] → Retry Logic → Notification
Integration with Other Skills
- •pagination: Generate URLs for multi-page scraping
- •html-parsing: Extract data from rendered HTML
- •error-handling: Retry on failures
- •debugging: Validate extracted data
Full Code and Documentation
Complete implementation with examples:
/mnt/d/work/n8n_agent/n8n-skills/anti-scraping/
Files:
- •
playwright-cloudflare.js- Main scraping script - •
README.md- Detailed documentation - •
example-workflow.json- n8n workflow example - •
config.template.env- Configuration template