Web Scraping with Browser Automation
Objectives
- •Use Playwright to simulate real user behavior and bypass anti-bot detection
- •Handle dynamic content, infinite scroll, and JavaScript-heavy sites
- •Implement robust error handling, retries, and rate limiting
- •Extract structured data efficiently
Core Strategy
1. Stealth Mode (Anti-Bot)
Always use stealth configuration to avoid detection:
python
# Remove automation flags
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
# Use realistic settings
browser = playwright.chromium.launch(
args=['--disable-blink-features=AutomationControlled']
)
2. Human-like Behavior
Add random delays and smooth interactions:
python
import random await asyncio.sleep(random.uniform(0.5, 2.0)) await page.mouse.move(x, y, steps=random.randint(10, 30))
3. Wait for Content
Use appropriate wait strategies:
python
# For static content
await page.goto(url, wait_until='networkidle')
# For dynamic content
await page.wait_for_selector('.content')
await page.wait_for_function("document.querySelectorAll('.item').length > 10")
Common Patterns
Pattern 1: Article/Blog Content
python
title = await page.locator('h1').first.text_content()
paragraphs = await page.locator('article p').all_text_contents()
content = '\n\n'.join(paragraphs)
Pattern 2: Infinite Scroll
python
while len(items) < max_items:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
current_items = await page.locator('.item').all()
if len(current_items) == previous_count:
break
Pattern 3: Handle Popups
python
# Close cookie banners and modals
try:
await page.click('button:has-text("Accept")', timeout=3000)
except:
pass
Pattern 4: Login Required
python
await page.fill('input[name="username"]', username)
await page.fill('input[name="password"]', password)
await page.click('button[type="submit"]')
await page.wait_for_url('**/dashboard')
cookies = await context.cookies() # Save for reuse
Rate Limiting & Retries
Implement rate limiting to avoid bans:
python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=4, max=10))
async def scrape_with_retry(url: str):
# Your scraping logic
pass
Track requests per time window:
python
class RateLimiter:
def __init__(self, max_requests: int, time_window: int):
self.max_requests = max_requests
self.time_window = time_window
self.requests = []
async def wait_if_needed(self):
# Remove old requests, wait if limit reached
pass
Data Extraction
Use BeautifulSoup for parsing after Playwright renders:
python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer']):
element.decompose()
# Extract structured data
article = soup.find('article') or soup.find('main')
paragraphs = [p.get_text(strip=True) for p in article.find_all('p')]
Caching
Cache results to minimize requests:
python
import hashlib
import json
def get_cache_path(url: str) -> Path:
url_hash = hashlib.md5(url.encode()).hexdigest()
return Path(f'.cache/{url_hash}.json')
# Check cache before scraping
cached = load_from_cache(url)
if cached:
return cached
Installation
bash
pip install playwright beautifulsoup4 lxml tenacity playwright install chromium # Or with uv uv add playwright beautifulsoup4 lxml tenacity uv run playwright install chromium
Project Structure
code
scripts/scrapers/ ├── base.py # Base scraper class with stealth mode ├── extractors/ # Site-specific extractors │ ├── medium.py │ ├── github.py │ └── generic.py ├── utils/ │ ├── stealth.py # Anti-bot utilities │ ├── cache.py # Caching logic │ └── rate_limit.py # Rate limiting └── config.py # User agents, timeouts, etc.
Validation Checklist
Before deploying:
- • Uses stealth mode (removes webdriver flag)
- • Implements rate limiting
- • Has retry logic with exponential backoff
- • Uses caching to avoid redundant requests
- • Handles errors gracefully
- • Closes browser resources properly
- • Respects robots.txt
- • Logs all operations
Common Issues
- •"Executable doesn't exist" → Run
playwright install chromium - •Timeout errors → Increase timeout or use
wait_until='domcontentloaded' - •Element not found → Add explicit waits with
wait_for_selector() - •Detected as bot → Use stealth mode, rotate user agents, add random delays
- •Memory leaks → Always close browser in finally block
Best Practices
- •Respect robots.txt - Check before scraping
- •Use caching - Avoid redundant requests
- •Rate limit - Don't overload servers
- •Rotate user agents - Avoid detection
- •Log everything - Debug and monitor
- •Handle errors - Retry with backoff
- •Clean up resources - Close browsers properly
For detailed code examples: See references/examples.md
For site-specific patterns: See references/patterns.md
For advanced anti-bot techniques: See references/stealth-guide.md