Web Scraper
A toolkit for extracting content from web pages using Python.
When to Use This Skill
Activate this skill when the user needs to:
- •Fetch the HTML content of a web page
- •Extract all links from a page
- •Get readable text content from HTML
- •Scrape data from websites
- •Download and analyze web content
Requirements
This skill requires external packages:
bash
pip install requests beautifulsoup4
Available Scripts
Always run scripts with --help first to see all available options.
| Script | Purpose |
|---|---|
fetch_page.py | Download HTML content from a URL |
extract_links.py | Extract all links from a page |
extract_text.py | Extract readable text from HTML |
Decision Tree
code
Task → What do you need?
│
├─ Raw HTML content?
│ └─ Use: fetch_page.py <url>
│
├─ List of links on a page?
│ └─ Use: extract_links.py <url>
│
└─ Text content (no HTML tags)?
└─ Use: extract_text.py <url>
Quick Examples
Fetch page HTML:
bash
python scripts/fetch_page.py https://example.com python scripts/fetch_page.py https://example.com --output page.html
Extract all links:
bash
python scripts/extract_links.py https://example.com python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"
Extract text content:
bash
python scripts/extract_text.py https://example.com python scripts/extract_text.py https://example.com --paragraphs
Best Practices
- •Respect robots.txt - Check if scraping is allowed
- •Add delays - Don't overwhelm servers with rapid requests
- •Use appropriate User-Agent - Identify your scraper properly
- •Handle errors gracefully - Websites may block or timeout
- •Cache responses - Don't re-fetch unchanged pages
Common Issues
- •403 Forbidden: Site may be blocking scrapers. Try with
--user-agentflag. - •Timeout: Site may be slow. Increase
--timeoutvalue. - •Empty content: Page may require JavaScript. These scripts handle static HTML only.
- •Encoding issues: Use
--encodingflag if text appears garbled.
Reference Files
See references/selectors.md for CSS selector syntax reference.
Ethical Considerations
- •Only scrape public data
- •Respect rate limits and robots.txt
- •Don't scrape personal/private information
- •Check website terms of service
- •Consider using official APIs when available