Web Scraper
Scrape websites and monitor for changes using browser automation or HTTP requests.
Quick Start
One-time scrape
bash
# Scrape headlines from a news site scripts/scrape.py "https://reuters.com" --selector "h3" --output headlines.json # Scrape with full browser (for JS-heavy sites) scripts/scrape.py "https://twitter.com/elonmusk" --browser --selector "[data-testid='tweetText']"
Monitor for changes
bash
# Monitor a page, alert on new content scripts/monitor.py "https://reuters.com/markets" --selector ".story-title" --interval 60
Commands
scrape.py
Extract content from a webpage.
bash
scripts/scrape.py <url> [options] Options: --selector, -s CSS selector to extract (default: body) --browser, -b Use full browser (for JS sites) --output, -o Output file (json/txt) --links Extract all links --images Extract all images --text Extract text only (clean) --wait Wait N seconds for JS to load --screenshot Take screenshot
Examples:
bash
# Get all headlines scripts/scrape.py "https://news.ycombinator.com" -s ".titleline > a" -o hn.json # Get with browser for JS content scripts/scrape.py "https://polymarket.com" --browser -s ".market-title" --wait 3
monitor.py
Watch a page for changes and alert.
bash
scripts/monitor.py <url> [options] Options: --selector, -s CSS selector to watch --interval, -i Check interval in seconds (default: 60) --output, -o Log changes to file --diff Show what changed --alert Print alert when change detected
Examples:
bash
# Monitor news for new headlines scripts/monitor.py "https://reuters.com" -s "h3.story-title" -i 30 --alert # Monitor and log all changes scripts/monitor.py "https://idealista.com/search" -s ".item-info" -i 60 -o changes.log
Use Cases
- •News monitoring: Watch headlines, detect breaking news
- •Price tracking: Monitor e-commerce prices
- •Availability alerts: Watch for stock/appointments
- •Content extraction: Scrape data for analysis
- •Change detection: Know when a page updates
Tips
- •Use
--browserfor sites that load content with JavaScript - •Use
--waitif content takes time to appear - •Start with short intervals, increase if rate-limited
- •Combine with other tools (jq, grep) for filtering