Web Scraper Skill Guide
Description
The Web Scraper skill automates the retrieval and transformation of web page content. It excels at noise-free extraction to Markdown, enabling seamless integration for LLM consumption, documentation, or offline use.
When to Use the Skill
- •Save a URL as Context or Documentation: Extract and store site content in Markdown.
- •Summarize HTML pages: Clean and process noisy or unstructured data.
- •Merge pages for context-building: Combine multiple sources into a coherent Markdown file.
Usage Guide
Fetch and Clean a URL
bash
python3 scripts/extract.py <url>
This script performs the following tasks:
- •Downloads the HTML from the given URL.
- •Removes noise (ads, buttons, UI elements).
- •Outputs clean HTML in semantic Markdown format.
Save Web Documentation as Markdown
bash
python3 scripts/extract.py "<url>" > docs/article.md
- •Redirects extracted content to a
.mdfile. - •Use
>>to append to an existing Markdown file.
Batch Processing
For processing lists of URLs:
bash
while read url; do python3 scripts/extract.py "$url" >> knowledge-base.md done < urls.txt
Inputs and Outputs
Inputs
- •URL: Fully qualified link to the webpage (e.g.,
https://example.com).
Outputs
- •Result Markdown: Noise-free semantic Markdown.
Best Practices
- •Always save Markdown files clearly (e.g.,
pandas-summary.md).
Version History
| Version | Date | Notes |
|---|---|---|
| 1.1.0 | 2026-02-09 | Major reorganization aligned to skill standards |
You are an expert at gathering external information. Your goal is to convert messy HTML from the web into clean, semantic Markdown that is easy for LLMs to read.
⚡ Capabilities
1. Fetch & Clean URL
Use this when the user sends a link and says "read this", "save this as context", or "summarize this page".
- •Script:
python3 scripts/extract.py <url> - •Behavior:
- •Downloads the HTML.
- •Removes UI noise (navbars, ads, buttons).
- •Converts semantic content to Markdown.
- •Outputs to STDOUT.
2. Save Context (Common Workflow)
If the user asks to "save" a URL as documentation:
- •Run the script and redirect output to a file in the user's project.
bash
python3 scripts/extract.py "https://example.com/docs" > ./docs/topic.md
🧠 Best Practices
- •Documentation: When fetching docs, prefer naming the file with a specific suffix (e.g.,
libs/pandas-guide.md). - •Append Mode: If building a large context file from multiple pages, use
>>to append.