What I do
- •Turn a user docs scraping request into a reusable script and generated docs snapshot output.
- •Standardize output under
docs/external/<source>/with one page-level.ext.mdfile per discovered page. - •Ensure each generated page has a metadata header and notes footer, and create an
index.ext.mdmanifest. - •Prefer routing implementation through the dedicated
scrapersubagent.
Accepted request patterns
Use this skill when the request includes one or more of:
- •A docs root URL (for example
https://example.dev/docs) - •A source/domain name (for example
moonrepo,astro,pnpm) - •A request to "scrape", "snapshot", "mirror", or "refresh" external docs
Output contract
- •Script:
scripts/scrapes/scrape_<source>_docs.sh.ts - •Output directory:
docs/external/<source>/ - •Output files:
- •Per-page files:
<stable-page-stem>.ext.md - •Index file:
index.ext.md
- •Per-page files:
Behavior requirements
- •Resolve source metadata from the user request:
- •
sourcekey (safe folder/script slug) - •docs root URL
- •discovery method (
sitemap.xmlpreferred)
- •
- •Reuse project script conventions:
- •Bun shebang (
#!/usr/bin/env bun) - •
*.sh.tsnaming - •helpers under
scripts/helpers/
- •Bun shebang (
- •Use resilient scraping strategy:
- •Primary:
r.jina.aimarkdown proxy - •Fallback: direct HTML fetch + conversion to markdown
- •Primary:
- •Normalize filenames from docs paths:
- •deterministic flattening (for example
docs__guides__intro.ext.md)
- •deterministic flattening (for example
- •Regenerate output cleanly:
- •remove old
*.ext.mdin target source directory - •write fresh per-page files and
index.ext.md
- •remove old
Generated page format
- •Top
----section with summary metadata:- •captured timestamp
- •source root
- •source page URL/path
- •keywords
- •concise summary
- •Middle body with markdown content snapshot
- •Bottom
----notes/comments/lessons section
Index format
- •Top
----section with source/capture metadata and summary totals - •Full page inventory with links to local page files and capture status
- •Bottom
----notes/comments/lessons section
Implementation workflow
- •Inspect existing scraper scripts for reuse patterns (
scripts/scrapes/scrape_*.sh.ts). - •Create or update
scripts/scrapes/scrape_<source>_docs.sh.ts. - •Run the script once to generate docs output.
- •Report totals (
pages,ok,failed) and notable blocked pages. - •If a new script entrypoint was introduced, update
README.mdandAGENTS.md.
Safety constraints
- •Never embed secrets, auth headers, or private tokens in script or output files.
- •Skip private/authenticated docs pages unless explicit credentials handling is requested and safe.
- •Keep scripts idempotent and deterministic where practical.
Return checklist
- •Script path(s) added/updated
- •Output directory generated/refreshed
- •Discovery strategy used
- •Capture totals and notable failures
- •Exact rerun command (
bun scripts/scrapes/scrape_<source>_docs.sh.ts)