What I do
- •Fetch and parse the root webpage
- •Extract all internal links from the page
- •Scrape content from each internal link (1 level deep)
- •Aggregate all content with proper formatting and structure
- •Generate a professional PDF with full Chinese and multi-language support including:
- •Cover page with URL, timestamp, and metadata
- •Table of contents with clickable links
- •Root page content
- •Linked pages content (organized alphabetically)
- •Reference section with all URLs
- •Clickable hyperlinks throughout
- •Native Chinese character rendering (中文完美支持)
When to use me
Use this skill when you need to:
- •Archive web documentation for offline access (离线存档网站文档)
- •Convert web content to shareable PDF format (网页转PDF分享)
- •Research compilation - gather all related pages into one document (研究论文汇总)
- •Content backup - save website snapshots with internal navigation (网站内容备份)
- •Knowledge consolidation - combine multiple related web pages into structured document (知识整合)
- •Chinese websites - Full UTF-8 and Chinese character support
- •Multi-language content - Works with English, Chinese, Japanese, Korean, and other languages
How to use me
Ask me to scrape a URL and convert to PDF:
Use the web-to-pdf skill to scrape https://www.cnblogs.com/blog and create a PDF
Or with custom output filename:
Use the web-to-pdf skill to scrape https://www.cnblogs.com/blog and save as blog-archive.pdf
What I need
- •
Target URL (required) - The webpage you want to scrape
- •Must be a valid HTTP/HTTPS URL
- •Supports Chinese domain names (IDN)
- •Internal links extracted from this page will be followed
- •
Output filename (optional) - Name for the generated PDF
- •Default:
{domain}_{timestamp}.pdf - •Example:
cnblogs_2026-02-15_220315.pdf
- •Default:
What I provide
- •
Dependency setup - Automatically installs required packages:
- •
requests- HTTP client for fetching pages - •
beautifulsoup4- HTML parsing and link extraction - •
reportlab- PDF generation with full Unicode/CJK support - •
lxml- XML/HTML processing backend - •
Pillow- Image processing and optimization
- •
- •
Web scraping with:
- •Automatic user-agent headers (respect robots.txt guidelines)
- •Connection timeout handling (30 seconds per page, 10 seconds per image)
- •Delay between requests to avoid overwhelming servers (0.5 seconds)
- •Clean HTML content extraction (removes scripts, styles, ads, navigation)
- •Proper UTF-8 and charset encoding detection (自动检测编码)
- •Full Chinese character support (完整中文支持)
- •Complete image extraction from all pages
- •
Internal link detection:
- •Filters out external links
- •Removes duplicate URLs
- •Handles relative URL resolution
- •Skips anchors and special links (#, javascript:, mailto:, etc.)
- •Respects domain boundaries
- •Supports Chinese domain names
- •Extracts and processes images from each linked page
- •
Image handling with:
- •Extracts images from root page AND all first-level linked pages ✅
- •Downloads and embeds images directly into PDF
- •Automatic image scaling and optimization
- •Size limits (5MB per image) to keep PDF manageable
- •Configurable maximum images per page
- •Handles multiple image formats (JPEG, PNG, WebP, GIF)
- •Graceful fallback if images fail to download
- •Maintains image aspect ratios in PDF
- •
PDF generation with:
- •Full Unicode and CJK font support (完整Unicode和CJK字体支持)
- •Professional formatting with margins and spacing
- •Clickable table of contents with page numbers
- •Clickable hyperlinks throughout document
- •Page headers and footers
- •Proper typography and readability
- •Native rendering of Chinese, Japanese, Korean characters (中日韩文字原生渲染)
- •All images embedded and visible in PDF ✅
- •Multi-language support
- •Cover page with metadata
- •Reference section with all source URLs
Important notes
- •One level deep: Only follows links directly present on the root page (仅跟踪根页面上的链接)
- •Internal only: Ignores external domain links (忽略外域链接)
- •Timeouts: 30 seconds per page, 10 seconds per image (adjustable)
- •Image extraction: Automatically extracts and embeds ALL images from root page AND first-level linked pages ✅
- •Request delays: 0.5 seconds between requests to be respectful to servers
- •Performance: Large sites with many images may take several minutes to process
- •Respect robots.txt: Implement reasonable crawl delays
- •Error handling: Continues processing if individual pages or images fail
- •Image limits: Up to 20 images per page by default, 5MB size limit per image
- •PDF size: Output depends on content volume (typically 1-100 MB with images)
- •Chinese support: Fully tested with Chinese websites like cnblogs.com, github.com, and more
- •Encoding: Automatically detects and handles UTF-8, GBK, and other common encodings
- •Image formats: Supports JPEG, PNG, WebP, GIF and automatically optimizes sizing
Error handling
The skill gracefully handles:
- •Unreachable URLs (404, 5xx errors)
- •Network timeouts
- •Malformed HTML
- •Missing or inaccessible internal links
- •Images that fail to download (skips them and continues)
- •Broken image links (gracefully skipped, no PDF corruption)
- •Encoding issues (now fixed with proper UTF-8 detection)
- •Chinese character encoding (中文编码问题已解决)
- •Invalid characters in filenames
- •Multi-language content mixed in single page
- •Oversized images (automatically skipped if > 5MB)
Implementation approach
1. Validate and normalize input URL 2. Install/verify required dependencies (requests, bs4, reportlab, Pillow, lxml) 3. Fetch root page with error handling and encoding detection 4. Parse HTML and extract internal links 5. For each page (root + internal links): - Fetch page content with proper charset detection - Parse and clean HTML while preserving Unicode text - Extract all images from page and download them - Store images in memory with unique keys - Extract readable text with UTF-8 encoding 6. Generate PDF document with reportlab: - Create professional cover page with metadata - Add table of contents with proper font rendering - For each page: title, URL, images, and content text - Add reference section with all clickable URLs - Save to file with full Unicode and image support 7. Return PDF file path and summary with image count
Example workflow with Chinese website
User input:
Use the web-to-pdf skill to archive https://www.cnblogs.com/rossiXYZ/p/18785601
Skill execution:
- •Fetches
https://www.cnblogs.com/rossiXYZ/p/18785601 - •Automatically detects UTF-8 encoding
- •Extracts Chinese title: "探秘Transformer系列之文章列表"
- •Finds internal links with Chinese text
- •Scrapes all pages preserving Chinese characters
- •Generates PDF with:
- •Cover: Title with Chinese characters rendered perfectly
- •TOC: 21 entries with Chinese titles like "注意力机制", "位置编码"
- •Pages: Full Chinese content with proper formatting
- •References: All URLs as clickable links
- •Saves to
cnblogs.com_20260215_220315.pdf - •Output: 0.10 MB PDF with perfectly rendered Chinese text
Output example
The generated PDF includes:
┌─────────────────────────────────────┐ │ 网页内容存档 │ │ Archived Web Content │ │ 源URL: https://www.cnblogs.com... │ │ 存档时间: 2026-02-15 22:03:45 UTC │ │ 总页数: 21 │ └─────────────────────────────────────┘ 目录 (TABLE OF CONTENTS) 1. 探秘Transformer系列之文章列表 2. 注意力机制 (Attention Mechanism) 3. 总体架构 (Overall Architecture) ... 21. 残差网络和归一化
Troubleshooting
If images don't appear in PDF:
- •✅ This issue is now fixed! Images from all pages (root + linked pages) are included
- •Check that the website images are accessible and not behind authentication
- •Very large images (>5MB) are automatically skipped
- •Some dynamically-loaded images may not be captured (static HTML only)
- •Verify PDF viewer supports image display
If Chinese characters don't display:
- •PDF now uses reportlab with native CJK support
- •Chinese characters should display perfectly
- •If still having issues, check PDF viewer supports UTF-8
If pages don't load:
- •Check URL is accessible in your browser
- •Verify network connectivity
- •Check for authentication requirements (not supported)
- •Some JavaScript-heavy sites may not scrape well
If PDF is incomplete:
- •Site may use JavaScript for content loading
- •Skill fetches static HTML only
- •Some dynamically-loaded content may not appear
- •Check network connectivity and timeouts
If PDF is very large:
- •Many images are included (expected with image extraction)
- •Consider reducing max_images_per_page parameter
- •Very large sites may take several minutes to process
If file path is very long:
- •Windows has 260-character path limits
- •Skill shortens filenames if needed
- •Check output message for actual filename used
Dependencies to be installed
pip install requests beautifulsoup4 reportlab lxml Pillow
These will be automatically installed if not present.
Recent updates
✅ IMAGE EXTRACTION NOW FULLY IMPLEMENTED:
- •Images from ALL pages are now extracted and embedded ✅
- •Root page images included
- •All first-level linked page images included ✅ (This was the missing feature!)
- •Automatic image downloading and optimization
- •Image size limits to keep PDFs manageable (5MB per image)
- •Graceful handling of broken image links
- •Proper aspect ratio maintenance
✅ Chinese support fully implemented:
- •Using reportlab for native CJK rendering
- •Full UTF-8 encoding support
- •Automatic charset detection
- •Tested with cnblogs.com and other Chinese websites
- •All Chinese characters display perfectly in output PDF
✅ PDF Quality improvements:
- •Professional cover page with metadata
- •Clickable table of contents
- •Improved typography and readability
- •Reference section with all clickable URLs
- •Better page organization