Blog Scraper Skill
Overview
This skill fetches blog articles from tech-lab.sios.jp/archives/*, compresses the HTML content by removing unnecessary attributes and whitespace, and saves the result to the doc/ directory with metadata.
When to Use
- •User requests to fetch a specific blog article
- •User wants to update existing cached articles
- •User needs to scrape multiple articles for analysis or documentation
Usage
Single Article
URL=https://tech-lab.sios.jp/archives/[article-id] npm run scraper
Example:
URL=https://tech-lab.sios.jp/archives/48397 npm run scraper
Multiple Articles
For multiple articles, run the command sequentially for each URL.
Output
The scraper will:
- •Fetch and parse the HTML from the specified URL
- •Extract content using the CSS selector
section.entry-content - •Compress by removing:
- •Scripts, styles, and noscript tags
- •Class, ID, and style attributes
- •Whitespace between tags
- •Preserve:
- •Image alt text as
[画像: alt] - •Image src URLs
- •Link href attributes
- •Image alt text as
- •Add metadata as HTML comment:
- •OGP title
- •Source URL
- •OGP image URL
- •Extraction timestamp
- •Save to
docs/data/tech-lab-sios-jp-archives-[id].html - •Report compression statistics:
- •Token count reduction (estimated for Claude)
- •Compression ratio percentages
- •File size
Cache Behavior
- •If the target HTML file already exists in
docs/data/, the scraper skips fetching and reports the existing file size - •To re-fetch, delete the existing HTML file first
Token Estimation
The scraper estimates Claude token usage for Japanese content:
- •Hiragana/Katakana: ~1.5 chars/token
- •Kanji: ~1 char/token
- •ASCII: ~4 chars/token
- •Other: ~2 chars/token
Typical compression achieves 60-85% token reduction.
Implementation Details
See application/tools/scraper.ts for the TypeScript implementation using:
- •
node-fetchfor HTTP requests - •
cheeriofor HTML parsing - •OGP metadata extraction
- •Custom token estimation for Japanese text
Permissions Required
This skill requires the following permissions in .claude/settings.local.json:
{
"permissions": {
"allow": [
"Bash(npm run scraper:*)",
"Bash(URL=:*)"
]
}
}
Note: The Bash(URL=:*) permission uses prefix matching to allow any URL environment variable pattern. This is a broad permission - consider restricting to specific domains if needed for security.