Web Scraper
Fetch, search, and extract content from websites.
When to use this skill
- •User asks to fetch or read a webpage / URL
- •User wants to search the internet for information
- •User needs to extract links, tables, or structured data from a website
- •User asks to crawl a JavaScript-rendered (dynamic) page
- •User wants web content converted to clean Markdown for analysis
Scripts overview
| Script | Purpose | Dependencies |
|---|---|---|
fetch_page.py | Fetch a URL and extract readable content as Markdown | requests, beautifulsoup4, readability-lxml, html2text |
search_web.py | Search the web via DuckDuckGo | ddgs |
crawl_dynamic.py | Crawl JS-rendered pages with a headless browser | crawl4ai |
extract_links.py | Extract and categorize all links from a page | requests, beautifulsoup4 |
Steps
1. Install dependencies (first time only)
For lightweight scraping (static pages, search, link extraction):
pip install requests beautifulsoup4 readability-lxml html2text ddgs
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
pip install crawl4ai crawl4ai-setup
Note:
crawl4ai-setupdownloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
CRITICAL — Dependency Error Recovery: If ANY script below fails with an
ImportErroror "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
2. Fetch a web page (static — recommended first choice)
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
python scripts/fetch_page.py "URL"
Options:
- •
--raw— Output full page Markdown instead of extracted article content - •
--selector "CSS_SELECTOR"— Extract only elements matching the CSS selector (e.g.".article-body","table","#content") - •
--save OUTPUT_PATH— Also save output to a file - •
--max-length N— Truncate output to N characters (default: no limit)
Examples:
# Fetch an article python fetch_page.py "https://example.com/article" # Extract only tables python fetch_page.py "https://example.com/data" --selector "table" # Fetch raw full-page markdown, limit to 5000 chars python fetch_page.py "https://example.com" --raw --max-length 5000
3. Search the web
Search using DuckDuckGo (no API key required).
python scripts/search_web.py "search query"
Options:
- •
--max-results N— Number of results to return (default: 10) - •
--region REGION— Region code, e.g.cn-zh,us-en,jp-jp(default:wt-wtfor worldwide) - •
--news— Search news instead of general web
Examples:
# General search python search_web.py "Python web scraping best practices 2025" # News search, Chinese region, 5 results python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
4. Crawl a dynamic / JavaScript-rendered page
Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
python scripts/crawl_dynamic.py "URL"
Options:
- •
--wait N— Wait N seconds after page load for JS to finish (default: 3) - •
--selector "CSS_SELECTOR"— Wait for a specific element to appear before extracting - •
--scroll— Scroll to bottom of page to trigger lazy loading - •
--save OUTPUT_PATH— Also save output to a file - •
--max-length N— Truncate output to N characters
5. Extract links from a page
Extract all links with their text labels, categorized by type (internal, external, resource).
python scripts/extract_links.py "URL"
Options:
- •
--filter PATTERN— Only show links matching a regex pattern (applied to URL) - •
--external-only— Only show external links - •
--json— Output as JSON instead of Markdown
Decision guide: which script to use
- •Start with
fetch_page.py— handles 90% of websites (articles, docs, blogs, wikis). - •If
fetch_page.pyreturns empty/garbled content → trycrawl_dynamic.py(the page likely needs JavaScript). - •Need to find URLs first? → Use
search_web.pyto discover relevant pages. - •Need to navigate a site structure? → Use
extract_links.pyto map out links, then fetch individual pages.
Common workflows
Research a topic
- •
search_web.py "topic"→ get relevant URLs - •
fetch_page.py "best_url"→ read the content - •Repeat for multiple sources, then synthesize
Scrape structured data from a page
- •
fetch_page.py "url" --selector "table"→ extract tables - •Or
fetch_page.py "url" --selector ".product-card"→ extract specific elements
Crawl a modern web app (SPA)
- •
crawl_dynamic.py "url" --wait 5 --scroll→ full JS-rendered content
Edge cases
- •Paywalled sites: May return partial content or login pages. Inform the user.
- •Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
- •Very large pages: Use
--max-lengthto truncate output and avoid overwhelming the context window. - •Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
- •Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
Scripts
- •fetch_page.py — Fetch and extract readable content as Markdown
- •search_web.py — Search the web via DuckDuckGo
- •crawl_dynamic.py — Crawl JavaScript-rendered pages
- •extract_links.py — Extract and categorize page links