AgentSkillsCN

web-to-pdf

以全面的中文支持抓取网页,并将其转换为带目录与内部链接的格式化PDF

SKILL.md
--- frontmatter
name: web-to-pdf
description: Scrape webpages with full Chinese support and convert to formatted PDF with table of contents and internal links
license: MIT
compatibility: opencode
metadata:
  audience: developers
  workflow: research
  languages: "English, Chinese, Multi-language"

What I do

  • Fetch and parse the root webpage
  • Extract all internal links from the page
  • Scrape content from each internal link (1 level deep)
  • Aggregate all content with proper formatting and structure
  • Generate a professional PDF with full Chinese and multi-language support including:
    • Cover page with URL, timestamp, and metadata
    • Table of contents with clickable links
    • Root page content
    • Linked pages content (organized alphabetically)
    • Reference section with all URLs
    • Clickable hyperlinks throughout
    • Native Chinese character rendering (中文完美支持)

When to use me

Use this skill when you need to:

  • Archive web documentation for offline access (离线存档网站文档)
  • Convert web content to shareable PDF format (网页转PDF分享)
  • Research compilation - gather all related pages into one document (研究论文汇总)
  • Content backup - save website snapshots with internal navigation (网站内容备份)
  • Knowledge consolidation - combine multiple related web pages into structured document (知识整合)
  • Chinese websites - Full UTF-8 and Chinese character support
  • Multi-language content - Works with English, Chinese, Japanese, Korean, and other languages

How to use me

Ask me to scrape a URL and convert to PDF:

code
Use the web-to-pdf skill to scrape https://www.cnblogs.com/blog and create a PDF

Or with custom output filename:

code
Use the web-to-pdf skill to scrape https://www.cnblogs.com/blog and save as blog-archive.pdf

What I need

  1. Target URL (required) - The webpage you want to scrape

    • Must be a valid HTTP/HTTPS URL
    • Supports Chinese domain names (IDN)
    • Internal links extracted from this page will be followed
  2. Output filename (optional) - Name for the generated PDF

    • Default: {domain}_{timestamp}.pdf
    • Example: cnblogs_2026-02-15_220315.pdf

What I provide

  1. Dependency setup - Automatically installs required packages:

    • requests - HTTP client for fetching pages
    • beautifulsoup4 - HTML parsing and link extraction
    • reportlab - PDF generation with full Unicode/CJK support
    • lxml - XML/HTML processing backend
    • Pillow - Image processing and optimization
  2. Web scraping with:

    • Automatic user-agent headers (respect robots.txt guidelines)
    • Connection timeout handling (30 seconds per page, 10 seconds per image)
    • Delay between requests to avoid overwhelming servers (0.5 seconds)
    • Clean HTML content extraction (removes scripts, styles, ads, navigation)
    • Proper UTF-8 and charset encoding detection (自动检测编码)
    • Full Chinese character support (完整中文支持)
    • Complete image extraction from all pages
  3. Internal link detection:

    • Filters out external links
    • Removes duplicate URLs
    • Handles relative URL resolution
    • Skips anchors and special links (#, javascript:, mailto:, etc.)
    • Respects domain boundaries
    • Supports Chinese domain names
    • Extracts and processes images from each linked page
  4. Image handling with:

    • Extracts images from root page AND all first-level linked pages
    • Downloads and embeds images directly into PDF
    • Automatic image scaling and optimization
    • Size limits (5MB per image) to keep PDF manageable
    • Configurable maximum images per page
    • Handles multiple image formats (JPEG, PNG, WebP, GIF)
    • Graceful fallback if images fail to download
    • Maintains image aspect ratios in PDF
  5. PDF generation with:

    • Full Unicode and CJK font support (完整Unicode和CJK字体支持)
    • Professional formatting with margins and spacing
    • Clickable table of contents with page numbers
    • Clickable hyperlinks throughout document
    • Page headers and footers
    • Proper typography and readability
    • Native rendering of Chinese, Japanese, Korean characters (中日韩文字原生渲染)
    • All images embedded and visible in PDF
    • Multi-language support
    • Cover page with metadata
    • Reference section with all source URLs

Important notes

  • One level deep: Only follows links directly present on the root page (仅跟踪根页面上的链接)
  • Internal only: Ignores external domain links (忽略外域链接)
  • Timeouts: 30 seconds per page, 10 seconds per image (adjustable)
  • Image extraction: Automatically extracts and embeds ALL images from root page AND first-level linked pages ✅
  • Request delays: 0.5 seconds between requests to be respectful to servers
  • Performance: Large sites with many images may take several minutes to process
  • Respect robots.txt: Implement reasonable crawl delays
  • Error handling: Continues processing if individual pages or images fail
  • Image limits: Up to 20 images per page by default, 5MB size limit per image
  • PDF size: Output depends on content volume (typically 1-100 MB with images)
  • Chinese support: Fully tested with Chinese websites like cnblogs.com, github.com, and more
  • Encoding: Automatically detects and handles UTF-8, GBK, and other common encodings
  • Image formats: Supports JPEG, PNG, WebP, GIF and automatically optimizes sizing

Error handling

The skill gracefully handles:

  • Unreachable URLs (404, 5xx errors)
  • Network timeouts
  • Malformed HTML
  • Missing or inaccessible internal links
  • Images that fail to download (skips them and continues)
  • Broken image links (gracefully skipped, no PDF corruption)
  • Encoding issues (now fixed with proper UTF-8 detection)
  • Chinese character encoding (中文编码问题已解决)
  • Invalid characters in filenames
  • Multi-language content mixed in single page
  • Oversized images (automatically skipped if > 5MB)

Implementation approach

python
1. Validate and normalize input URL
2. Install/verify required dependencies (requests, bs4, reportlab, Pillow, lxml)
3. Fetch root page with error handling and encoding detection
4. Parse HTML and extract internal links
5. For each page (root + internal links):
   - Fetch page content with proper charset detection
   - Parse and clean HTML while preserving Unicode text
   - Extract all images from page and download them
   - Store images in memory with unique keys
   - Extract readable text with UTF-8 encoding
6. Generate PDF document with reportlab:
   - Create professional cover page with metadata
   - Add table of contents with proper font rendering
   - For each page: title, URL, images, and content text
   - Add reference section with all clickable URLs
   - Save to file with full Unicode and image support
7. Return PDF file path and summary with image count

Example workflow with Chinese website

User input:

code
Use the web-to-pdf skill to archive https://www.cnblogs.com/rossiXYZ/p/18785601

Skill execution:

  1. Fetches https://www.cnblogs.com/rossiXYZ/p/18785601
  2. Automatically detects UTF-8 encoding
  3. Extracts Chinese title: "探秘Transformer系列之文章列表"
  4. Finds internal links with Chinese text
  5. Scrapes all pages preserving Chinese characters
  6. Generates PDF with:
    • Cover: Title with Chinese characters rendered perfectly
    • TOC: 21 entries with Chinese titles like "注意力机制", "位置编码"
    • Pages: Full Chinese content with proper formatting
    • References: All URLs as clickable links
  7. Saves to cnblogs.com_20260215_220315.pdf
  8. Output: 0.10 MB PDF with perfectly rendered Chinese text

Output example

The generated PDF includes:

code
┌─────────────────────────────────────┐
│   网页内容存档                      │
│   Archived Web Content              │
│   源URL: https://www.cnblogs.com... │
│   存档时间: 2026-02-15 22:03:45 UTC │
│   总页数: 21                        │
└─────────────────────────────────────┘

目录 (TABLE OF CONTENTS)
1. 探秘Transformer系列之文章列表
2. 注意力机制 (Attention Mechanism)
3. 总体架构 (Overall Architecture)
...
21. 残差网络和归一化

Troubleshooting

If images don't appear in PDF:

  • ✅ This issue is now fixed! Images from all pages (root + linked pages) are included
  • Check that the website images are accessible and not behind authentication
  • Very large images (>5MB) are automatically skipped
  • Some dynamically-loaded images may not be captured (static HTML only)
  • Verify PDF viewer supports image display

If Chinese characters don't display:

  • PDF now uses reportlab with native CJK support
  • Chinese characters should display perfectly
  • If still having issues, check PDF viewer supports UTF-8

If pages don't load:

  • Check URL is accessible in your browser
  • Verify network connectivity
  • Check for authentication requirements (not supported)
  • Some JavaScript-heavy sites may not scrape well

If PDF is incomplete:

  • Site may use JavaScript for content loading
  • Skill fetches static HTML only
  • Some dynamically-loaded content may not appear
  • Check network connectivity and timeouts

If PDF is very large:

  • Many images are included (expected with image extraction)
  • Consider reducing max_images_per_page parameter
  • Very large sites may take several minutes to process

If file path is very long:

  • Windows has 260-character path limits
  • Skill shortens filenames if needed
  • Check output message for actual filename used

Dependencies to be installed

bash
pip install requests beautifulsoup4 reportlab lxml Pillow

These will be automatically installed if not present.

Recent updates

IMAGE EXTRACTION NOW FULLY IMPLEMENTED:

  • Images from ALL pages are now extracted and embedded
  • Root page images included
  • All first-level linked page images included ✅ (This was the missing feature!)
  • Automatic image downloading and optimization
  • Image size limits to keep PDFs manageable (5MB per image)
  • Graceful handling of broken image links
  • Proper aspect ratio maintenance

Chinese support fully implemented:

  • Using reportlab for native CJK rendering
  • Full UTF-8 encoding support
  • Automatic charset detection
  • Tested with cnblogs.com and other Chinese websites
  • All Chinese characters display perfectly in output PDF

PDF Quality improvements:

  • Professional cover page with metadata
  • Clickable table of contents
  • Improved typography and readability
  • Reference section with all clickable URLs
  • Better page organization