Crawl4AI - Website Crawling & Content Extraction
Overview
This skill enables crawling entire websites and extracting clean, structured content with proper metadata separation. It uses an efficient 2-phase approach that separates bulk crawling from content processing.
Two-Phase Architecture
Phase 1: Bulk Crawling (bulk_crawl.py)
- •Recursively crawls entire website tree (configurable depth)
- •Parallel loading of multiple pages simultaneously
- •Excludes nav/header/footer elements during crawl
- •Saves raw data:
raw.html,raw.md,metadata.json - •Fast and efficient - no per-page processing overhead
Phase 2: Post-Processing (postprocess.py)
- •Processes all crawled pages offline
- •Cleans markdown content (removes pagination, duplicates, nav remnants)
- •Enriches metadata with AI (description, keywords) or heuristic methods
- •Saves final:
content.md(clean, NO frontmatter!) + enrichedmetadata.json - •Metadata stays in JSON - markdown stays pure
Key Benefits:
- •Separation of Concerns: Crawling ≠ Processing
- •Scalability: Crawl entire sites in parallel, process later
- •Flexibility: Re-process data without re-crawling
- •Clean Output: Markdown without frontmatter, metadata in JSON
- •Cost Efficient: Batch AI calls for metadata generation
Workflow
For Single Pages:
Use crawl_to_markdown.py for quick single-page extraction with intelligent content filtering.
For Entire Websites (Recommended):
Step 1: Bulk Crawl
python scripts/bulk_crawl.py <url> --output-dir ./site --max-depth 3
- •Crawls entire website recursively
- •Parallel loading of pages
- •Saves raw data for all pages
Step 2: Post-Process
python scripts/postprocess.py ./site
- •Cleans all markdown files
- •Generates metadata with AI
- •Creates final
content.md+ enrichedmetadata.jsonfor each page
Result:
site/
index/
raw.html # Original HTML (for reference)
raw.md # Original markdown from crawl4ai
content.md # ✨ Clean markdown (NO frontmatter!)
metadata.json # ✨ All metadata here
about/
raw.html
raw.md
content.md
metadata.json
...
Crawling Process
Phase 1: Bulk Crawl (bulk_crawl.py)
Crawls an entire website recursively and saves raw data:
python scripts/bulk_crawl.py <url> [options]
Parameters:
- •
<url>: Start URL to crawl (required) - •
--output-dir PATH: Output directory (default: ./crawled_site) - •
--max-depth N: Maximum crawl depth (default: 3) - •
--wait-time N: JavaScript wait time in seconds (default: 5.0) - •
--allow-external: Allow crawling external domains (default: same-domain only)
What it does:
- •Starts from given URL
- •Extracts all internal links
- •Crawls pages in parallel (depth-first)
- •Excludes nav/header/footer during crawl
- •Saves for each page:
- •
raw.html- Original HTML - •
raw.md- Raw markdown from crawl4ai - •
metadata.json- Basic metadata (url, title, links, crawled_at)
- •
Examples:
Crawl entire site with max depth 3:
python scripts/bulk_crawl.py https://example.com --max-depth 3
Shallow crawl (only start page + direct links):
python scripts/bulk_crawl.py https://example.com --max-depth 1
Phase 2: Post-Processing (postprocess.py)
Processes bulk-crawled data and generates final output:
python scripts/postprocess.py <crawled_dir> [options]
Parameters:
- •
<crawled_dir>: Directory with bulk-crawled data (required) - •
--no-ai: Disable AI metadata generation (use heuristic methods)
What it does:
- •Finds all pages with
raw.mdfiles - •For each page:
- •Cleans markdown (removes pagination, duplicates, nav remnants)
- •Generates description & keywords (AI or heuristic)
- •Detects language from HTML
- •Calculates content hash and token estimate
- •Saves
content.md(clean, NO frontmatter!) - •Enriches
metadata.jsonwith all metadata
Examples:
Post-process with AI (requires ANTHROPIC_API_KEY):
export ANTHROPIC_API_KEY=sk-ant-... python scripts/postprocess.py ./crawled_site
Post-process without AI (heuristic methods):
python scripts/postprocess.py ./crawled_site --no-ai
Legacy: Single-Page Crawl (crawl_to_markdown.py)
For quick single-page extraction (legacy mode):
python scripts/crawl_to_markdown.py <url> --output-dir ./output
Note: For multi-page sites, use the 2-phase approach (bulk_crawl.py + postprocess.py) instead.
Metadata Structure
Metadata is stored separately in metadata.json (NO frontmatter in markdown!):
{
"url": "https://example.com/page",
"crawled_at": "2025-11-04T15:18:23.075744",
"title": "Page Title",
"links": ["https://example.com/about", ...],
"description": "AI-generated or heuristic description (1-2 sentences)",
"keywords": ["keyword1", "keyword2", ...],
"language": "en",
"content_hash": "b2ddd73c87e2af...",
"estimated_tokens": 1250,
"processed_at": "2025-11-04T15:18:47.280333"
}
Advantages:
- •Clean Separation: Content in
.md, metadata in.json - •Easy Processing: Parse JSON without markdown parsing
- •Flexible: Add metadata fields without touching content
- •Portable: Markdown files work anywhere without frontmatter issues
Content Cleaning
The script automatically removes:
- •Navigation menus (main nav, submenu controls)
- •Page headers and footers
- •Pagination controls (1|2|3, Weiter, Zurück, etc.)
- •Slider/carousel controls
- •Duplicate sections
- •Empty sections and excessive whitespace
- •Menu/navigation links
It preserves:
- •Main content text and headings
- •Content links relevant to the topic
- •Images within the main content
- •Structured data (lists, tables where present)
- •Proper markdown formatting
Output Structure
2-Phase Approach Output
crawled_site/
├── crawl_summary.json # Summary of crawl (total pages, urls, etc.)
├── index/
│ ├── raw.html # Original HTML from crawl
│ ├── raw.md # Original markdown from crawl4ai
│ ├── content.md # ✨ Clean content (after post-processing)
│ └── metadata.json # ✨ All metadata (enriched after post-processing)
├── about/
│ ├── raw.html
│ ├── raw.md
│ ├── content.md
│ └── metadata.json
└── contact/
├── raw.html
├── raw.md
├── content.md
└── metadata.json
File Roles:
- •
raw.html- Reference/debugging, contains original HTML - •
raw.md- Raw output from crawl4ai (before cleaning) - •
content.md- Final clean markdown (NO frontmatter!) - •
metadata.json- All metadata (title, description, keywords, etc.)
Legacy Single-Page Output
crawled_site/ └── index.md # Content + frontmatter (legacy mode)
Usage Workflow
Recommended: 2-Phase Workflow
Step 1: Bulk Crawl
# Crawl entire website python scripts/bulk_crawl.py https://example.com --max-depth 3 --output-dir ./mysite
Step 2: Post-Process
# Process all pages export ANTHROPIC_API_KEY=sk-ant-... # Optional, for AI metadata python scripts/postprocess.py ./mysite
Result: Clean content.md + enriched metadata.json for each page!
Quick Single-Page Extraction (Legacy)
# For single pages only python scripts/crawl_to_markdown.py https://example.com
Advanced Options
Control crawl depth:
python scripts/bulk_crawl.py https://example.com --max-depth 2 # Shallow crawl
JavaScript-heavy sites:
python scripts/bulk_crawl.py https://example.com --wait-time 10.0
Skip AI metadata generation:
python scripts/postprocess.py ./mysite --no-ai # Use heuristic methods
KI-gestützte Metadaten (Optional)
Der Skill kann Claude Haiku nutzen, um intelligente Descriptions und Keywords zu generieren.
Mit KI (Empfohlen):
export ANTHROPIC_API_KEY=sk-ant-...
Vorteile:
- •Intelligente, kontextbezogene Descriptions
- •Semantisch sinnvolle Keywords mit korrekter Großschreibung
- •Versteht den Inhalt und fasst ihn zusammen
Kosten: ~0.2 Cent pro Seite (Claude Haiku)
Ohne KI (Fallback):
Wenn kein API-Key gesetzt ist, nutzt der Skill automatisch heuristische Methoden:
- •Extrahiert erste substantielle Textzeile als Description
- •Ermittelt Keywords nach Häufigkeit
- •Funktioniert gut, aber weniger präzise
Der Skill zeigt beim Start an, welcher Modus verwendet wird:
- •
✨ KI-Metadaten-Generierung aktiviert (Claude Haiku)- KI wird genutzt - •
ℹ️ KI-Metadaten-Generierung nicht verfügbar- Fallback-Methoden werden genutzt
Installation Requirements
The crawl4ai library must be installed in a virtual environment. If the venv doesn't exist yet, create and install:
# Navigate to skill directory cd /Users/tobiasbrummer/.claude/skills/crawl4ai # Create virtual environment (if not exists) python3 -m venv venv # Activate and install crawl4ai source venv/bin/activate pip install crawl4ai # Install anthropic for KI-based metadata generation (optional but recommended) pip install anthropic # Install playwright browsers (one-time setup) playwright install chromium
Important: All scripts must be run using the venv's Python interpreter:
/Users/tobiasbrummer/.claude/skills/crawl4ai/venv/bin/python3 scripts/crawl_to_markdown.py <url>
See references/crawl4ai_reference.md for detailed API documentation.
Common Issues and Solutions
Issue: Pages taking too long
- •Reduce
--wait-timeif site doesn't need much JavaScript - •Increase
--wait-timeif content is missing (up to 10s for heavy JS sites)
Issue: Wrong content extracted
- •The script uses intelligent heuristics to find main content
- •If it picks the wrong element, the site may have unusual structure
- •Check the console output to see what selector was used
Issue: Missing content
- •Some sites may need longer
--wait-timefor JavaScript - •Try increasing from 5s to 8-10s for dynamic content
- •Check if site requires authentication (not supported)
Issue: Still seeing some navigation
- •The 3-stage cleaning is aggressive but may miss some edge cases
- •The script will continuously improve pattern matching
- •Most navigation is successfully removed (85-95% reduction typical)
Tips for Best Results
- •Default settings work well for most sites
- •Check token estimate in output to verify content size
- •Content hash in frontmatter allows easy change detection for re-crawls
- •Review the extracted content to ensure quality
- •Adjust wait-time for JavaScript-heavy sites
- •Use descriptive output directories when crawling multiple pages