Web Scraping and Data Collection
Scrapy spiders, Playwright browser automation, httpx async extraction, rate limiting, pagination, and data cleaning pipelines.
Scraping Tool Decision Table
| Site Type | Tool | Why |
|---|---|---|
| Static HTML, many pages | Scrapy | Async engine, built-in pipelines, middleware, autothrottle |
| JS-rendered SPA | Playwright | Full browser, handles dynamic content, stealth plugins |
| REST/GraphQL API | httpx | Async HTTP client, no browser overhead |
| Simple one-off extraction | BeautifulSoup + httpx | Lightweight, no framework overhead |
| Login-gated + JS-heavy | Playwright + stealth | Cookie handling, human-like interaction |
| Large-scale distributed | Scrapy + Scrapy-Redis | Distributed queue, deduplication, resume |
Scrapy Spider Patterns
CrawlSpider with Rules
python
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ProductSpider(CrawlSpider):
name = "products"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/catalog"]
rules = (
Rule(LinkExtractor(allow=r"/category/"), follow=True),
Rule(LinkExtractor(allow=r"/product/\d+"), callback="parse_product"),
)
def parse_product(self, response):
yield {
"url": response.url,
"title": response.css("h1.product-title::text").get("").strip(),
"price": response.css("span.price::text").re_first(r"[\d.]+"),
"description": response.css("div.description p::text").getall(),
"sku": response.xpath('//meta[@itemprop="sku"]/@content').get(),
}
Item Pipeline
python
import json
from itemadapter import ItemAdapter
class CleaningPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
for field in adapter.field_names():
val = adapter.get(field)
if isinstance(val, str):
adapter[field] = val.strip()
if not adapter.get("title"):
raise scrapy.exceptions.DropItem(f"Missing title: {item}")
return item
class JsonLinesPipeline:
def open_spider(self, spider):
self.file = open(f"{spider.name}.jsonl", "w")
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.file.write(json.dumps(ItemAdapter(item).asdict()) + "\n")
return item
Rate Limiting Settings
python
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 30
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 4
ROBOTSTXT_OBEY = True
USER_AGENT = "MyBot/1.0 (+https://example.com/bot)"
ITEM_PIPELINES = {
"myproject.pipelines.CleaningPipeline": 100,
"myproject.pipelines.JsonLinesPipeline": 200,
}
Playwright for JS-Rendered Pages
python
from playwright.async_api import async_playwright
async def scrape_spa(url: str) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
viewport={"width": 1280, "height": 720},
)
page = await context.new_page()
# Block images/fonts for speed
await page.route("**/*.{png,jpg,gif,svg,woff2}", lambda r: r.abort())
await page.goto(url, wait_until="networkidle")
await page.wait_for_selector("div.results", timeout=10000)
items = await page.evaluate("""
() => Array.from(document.querySelectorAll('.result-card')).map(el => ({
title: el.querySelector('h3')?.textContent?.trim(),
link: el.querySelector('a')?.href,
}))
""")
await browser.close()
return items
Infinite Scroll Handling
python
async def scrape_infinite_scroll(page, max_items: int = 500) -> list[dict]:
items, prev_count, stale_rounds = [], 0, 0
while len(items) < max_items and stale_rounds < 3:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
cards = await page.query_selector_all(".item-card")
items = [await extract_card(c) for c in cards]
stale_rounds = stale_rounds + 1 if len(items) == prev_count else 0
prev_count = len(items)
return items[:max_items]
httpx Async Client for APIs
python
import httpx
import asyncio
async def fetch_paginated_api(base_url: str, max_pages: int = 50) -> list[dict]:
results = []
async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
cursor = None
for _ in range(max_pages):
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
resp = await client.get(f"{base_url}/items", params=params)
resp.raise_for_status()
data = resp.json()
results.extend(data["items"])
cursor = data.get("next_cursor")
if not cursor:
break
await asyncio.sleep(0.5) # politeness delay
return results
Pagination Patterns
Next-Page Link (Scrapy)
python
def parse_listing(self, response):
for card in response.css("div.item"):
yield {"title": card.css("h3::text").get()}
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse_listing)
Data Cleaning Pipeline
python
import re
from bs4 import BeautifulSoup
def clean_html_text(raw_html: str) -> str:
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style"]):
tag.decompose()
return re.sub(r"\s+", " ", soup.get_text(separator=" ", strip=True)).strip()
def extract_price(text: str) -> float | None:
match = re.search(r"[\d,]+\.?\d*", text.replace(",", ""))
return float(match.group()) if match else None
def normalize_record(raw: dict) -> dict:
return {
"title": raw.get("title", "").strip()[:500],
"price": extract_price(raw.get("price_text", "")),
"description": clean_html_text(raw.get("desc_html", "")),
}
Legal and Ethical Considerations
| Aspect | Guidance |
|---|---|
| robots.txt | Always obey; ROBOTSTXT_OBEY = True in Scrapy |
| Rate limiting | 1-2 req/sec per domain minimum; use AutoThrottle |
| Terms of Service | Check ToS for scraping prohibitions before starting |
| Personal data | GDPR/CCPA apply; avoid scraping PII without legal basis |
| Caching | Cache responses locally to avoid redundant requests |
| Identification | Set a descriptive User-Agent with contact URL |
Gotchas
- •Scrapy
response.css()returnsSelectorList: use.get()for first match,.getall()for all..extract()is deprecated. - •
robots.txtcaching: Scrapy caches per domain. Stale rules apply on long crawls; consider periodic re-fetch. - •Playwright
wait_until="networkidle"can hang on heavy sites. Use"domcontentloaded"+ explicitwait_for_selector. - •Memory on large crawls: enable
JOBDIRfor pause/resume,CLOSESPIDER_ITEMCOUNTfor bounded runs. - •Cloudflare / anti-bot:
playwright-stealthpatches navigator properties. Rotate user agents and use proxies. - •httpx vs aiohttp: httpx has simpler API and HTTP/2; aiohttp is faster for high-concurrency websockets. Default to httpx.
- •Selector fragility: prefer
data-testidor semantic attributes over nested CSS. XPathcontains(@class, "x")is more resilient. - •Character encoding: Scrapy auto-detects. For httpx, check
resp.encoding; useresp.apparent_encodingas fallback.