Webcrawler Skill
Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.
Last Updated: 2026-01-23
Quick Start
bash
# Crawl Python documentation about async/await python skills/webcrawler/scripts/crawl_docs.py \ --url "https://docs.python.org/3/library/asyncio.html" \ --subject "asyncio" \ --depth 2 \ --output .tmp/docs/python-asyncio/ # Crawl React documentation python skills/webcrawler/scripts/crawl_docs.py \ --url "https://react.dev/" \ --subject "React" \ --depth 3 \ --output .tmp/docs/react/ # Extract only API reference pages python skills/webcrawler/scripts/crawl_docs.py \ --url "https://expressjs.com/en/4x/api.html" \ --subject "Express API" \ --filter "api" \ --output .tmp/docs/express-api/
Core Workflow
- •Initialize Crawl — Provide base URL and subject focus
- •Discover Pages — Recursively find all linked documentation pages
- •Filter Content — Keep only pages matching the subject criteria
- •Extract Content — Convert HTML to clean markdown
- •Organize Output — Structure files in a navigable hierarchy
- •Generate Index — Create a master index with all harvested pages
Scripts
crawl_docs.py — Main Documentation Crawler
The primary crawling script that handles recursive page discovery and content extraction.
bash
python skills/webcrawler/scripts/crawl_docs.py \ --url <base-url> # Starting URL (required) --subject <topic> # Subject focus for filtering (required) --output <directory> # Output directory (default: .tmp/crawled/) --depth <n> # Max crawl depth (default: 2) --filter <pattern> # URL path filter pattern (optional) --delay <seconds> # Delay between requests (default: 0.5) --max-pages <n> # Maximum pages to crawl (default: 100) --same-domain # Stay within same domain (default: true) --include-code # Preserve code blocks (default: true) --format <md|json|both> # Output format (default: both)
Outputs:
- •
index.md— Master index with links to all pages - •
pages/*.md— Individual markdown files per page - •
metadata.json— Crawl metadata and page inventory - •
content.json— Structured JSON with all extracted content
extract_page.py — Single Page Extractor
Extract content from a single documentation page.
bash
python skills/webcrawler/scripts/extract_page.py \ --url <page-url> # Page to extract (required) --output <file> # Output file (default: stdout) --format <md|json> # Output format (default: md) --include-links # Include internal links (default: true)
filter_docs.py — Post-Crawl Filtering
Filter already-crawled documentation by subject or pattern.
bash
python skills/webcrawler/scripts/filter_docs.py \ --input <crawl-dir> # Crawled docs directory (required) --subject <topic> # Subject to filter for (required) --output <directory> # Filtered output directory (required) --threshold <0.0-1.0> # Relevance threshold (default: 0.3)
Configuration
Rate Limiting & Politeness
The crawler respects robots.txt and implements polite crawling:
- •Default delay: 0.5s between requests
- •User-Agent: Identifies as documentation harvester
- •robots.txt: Honored by default (disable with
--ignore-robots)
Domain Handling
| Mode | Behavior |
|---|---|
--same-domain | Only crawl pages on the starting domain |
--same-path | Only crawl pages under the starting URL path |
--allow-subdomains | Include subdomains (e.g., api.example.com) |
Content Extraction
The crawler uses intelligent content extraction:
- •Main content detection — Finds
<main>,<article>, or content containers - •Navigation removal — Strips headers, footers, sidebars
- •Code preservation — Maintains code blocks with language hints
- •Link normalization — Converts relative links to absolute
- •Image handling — Optionally downloads and references images
Output Structure
code
.tmp/docs/<subject>/
├── index.md # Master index with TOC
├── metadata.json # Crawl metadata
├── content.json # Structured JSON export
└── pages/
├── getting-started.md
├── installation.md
├── api-reference.md
├── configuration/
│ ├── basic.md
│ └── advanced.md
└── troubleshooting.md
Index Format
markdown
# <Subject> Documentation > Crawled from: <base-url> > Pages: <count> > Date: <timestamp> ## Table of Contents - [Getting Started](pages/getting-started.md) - [Installation](pages/installation.md) - [API Reference](pages/api-reference.md) - Configuration - [Basic](pages/configuration/basic.md) - [Advanced](pages/configuration/advanced.md) - [Troubleshooting](pages/troubleshooting.md)
Common Workflows
1. Harvest API Documentation
bash
# Crawl API docs with deep recursion python skills/webcrawler/scripts/crawl_docs.py \ --url "https://api.example.com/docs" \ --subject "Example API" \ --depth 4 \ --filter "/api/" \ --output .tmp/docs/example-api/
2. Build RAG Knowledge Base
bash
# Crawl and export as JSON for embedding python skills/webcrawler/scripts/crawl_docs.py \ --url "https://docs.example.com" \ --subject "Example Docs" \ --depth 3 \ --format json \ --output .tmp/rag/example/ # The content.json can be fed directly to embedding pipelines
3. Offline Documentation Mirror
bash
# Full documentation harvest python skills/webcrawler/scripts/crawl_docs.py \ --url "https://docs.kubernetes.io/docs/concepts/" \ --subject "Kubernetes Concepts" \ --depth 5 \ --max-pages 500 \ --include-images \ --output .tmp/docs/k8s-concepts/
4. Focused Topic Extraction
bash
# Crawl, then filter to specific topic python skills/webcrawler/scripts/crawl_docs.py \ --url "https://developer.hashicorp.com/terraform/docs" \ --subject "Terraform" \ --depth 3 \ --output .tmp/docs/terraform-full/ # Filter to AWS provider only python skills/webcrawler/scripts/filter_docs.py \ --input .tmp/docs/terraform-full/ \ --subject "AWS Provider" \ --output .tmp/docs/terraform-aws/
Best Practices
Crawling
- •Start shallow — Begin with
--depth 1to test, then increase - •Use filters — Narrow scope with
--filterpatterns - •Set page limits — Use
--max-pagesto prevent runaway crawls - •Respect rate limits — Increase
--delayfor slower servers
Content Quality
- •Subject focus — Be specific with
--subjectfor better filtering - •Review index — Check
index.mdto verify crawl coverage - •Post-filter — Use
filter_docs.pyto refine results
Storage
- •Use
.tmp/— Store crawled docs in the temp directory - •Organize by subject — Create subdirectories per topic
- •Version with dates — Add timestamps for recurring crawls
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| 403 Forbidden | Blocked by server | Increase delay, check robots.txt |
| Empty pages | JavaScript-rendered content | Use --render-js (requires Playwright) |
| Too many pages | Unbounded crawl | Lower depth, use filters |
| Duplicate content | Same page via multiple URLs | Enabled by default (URL normalization) |
| Missing code blocks | Extraction issue | Check --include-code is enabled |
Dependencies
Required Python packages:
bash
pip install requests beautifulsoup4 html2text lxml # Optional for JavaScript rendering: pip install playwright && playwright install
Related Skills
- •qdrant-memory — Store crawled docs in vector database for RAG
- •pdf-reader — Extract text from PDF documentation
External Resources
- •Scrapy Documentation — For complex crawling needs
- •html2text — HTML to Markdown conversion
- •BeautifulSoup — HTML parsing