AgentSkillsCN

jb-docs-scraper

将文档网站抓取为本地 Markdown 文件,以供 AI 获取上下文信息。只需提供基础 URL,即可遍历文档内容,并将结果存储于 ./docs(或自定义路径)中。采用 crawl4ai 进行广度优先深度爬取。

SKILL.md
--- frontmatter
name: jb-docs-scraper
description: Scrape documentation websites into local markdown files for AI context. Takes a base URL and crawls the documentation, storing results in ./docs (or custom path). Uses crawl4ai with BFS deep crawling.

Documentation Scraper

Scrape any documentation website into local markdown files. Uses crawl4ai for async web crawling.

Quick Start

bash
# Scrape any documentation URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>

# Examples
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind

Output goes to ./docs/<auto-detected-name>/ by default.

Prerequisites (First Time Only)

bash
uv run --with crawl4ai playwright install

Usage

bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]

Options

OptionDescriptionDefault
-o, --output PATHOutput directory./docs/<auto-detected-name>
--max-depth NMaximum link depth6
--max-pages NMaximum pages to scrape500
--url-pattern PATTERNURL filter (glob)Auto-detected
-q, --quietSuppress verbose outputFalse

Examples

bash
# Basic - scrape to ./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://mediasoup.org/documentation/v3/

# Custom output directory
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://docs.rombo.co/tailwind \
  --output ./my-tailwind-docs

# Limit crawl scope
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://tanstack.com/start/latest/docs/framework/react/overview \
  --max-pages 50 \
  --max-depth 3

# Custom URL pattern filter
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://example.com/docs/api/v2/ \
  --url-pattern "*api/v2/*"

How It Works

  1. Auto-detects domain and URL pattern from the input URL
  2. Crawls using BFS (breadth-first search) strategy
  3. Filters to stay within the documentation section
  4. Converts pages to clean markdown
  5. Saves with directory structure mirroring the URL paths

Output Structure

code
docs/<name>/
  index.md           # Root page
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

Troubleshooting

IssueSolution
Playwright browser binaries are missingRun uv run --with crawl4ai playwright install
Empty outputCheck if URL pattern matches actual doc URLs. Try --url-pattern
Missing pagesIncrease --max-depth or --max-pages
Wrong pages scrapedUse stricter --url-pattern

Tips

  1. Test first - Use --max-pages 10 to verify config before full crawl
  2. Check output name - Script auto-detects from URL path segments
  3. Rerun safe - Files are overwritten, duplicates skipped