AgentSkillsCN

fetcher

自动执行回退机制,并支持内容提取,轻松抓取网页、PDF 及各类文档。当用户说“抓取这个网址”、“下载这个页面”、“爬取这个网站”、“提取内容”、“获取 PDF”或提供待检索的 URL 时,即可使用此功能。

SKILL.md
--- frontmatter
name: fetcher
description: >
  Fetch web pages, PDFs, and documents with automatic fallbacks and content extraction.
  Use when user says "fetch this URL", "download this page", "crawl this website",
  "extract content from", "get the PDF", or provides URLs needing retrieval.
allowed-tools: Bash, Read
triggers:
  - fetch this URL
  - download page
  - crawl website
  - extract content from
  - get the PDF
  - scrape this site
  - retrieve document
metadata:
  short-description: Web crawling and document fetching CLI

Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via uv run from git (no pre-installation needed).

Simplest Usage

bash
# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

bash
./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

bash
fetcher get https://www.nasa.gov --out run/nasa

Outputs to run/nasa/:

  • consumer_summary.json - structured result
  • Walkthrough.md - human-readable summary
  • downloads/ - raw content files

Fetch multiple URLs

bash
# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

bash
fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

bash
fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

code
run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

bash
export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

bash
export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

bash
# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

bash
# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

bash
export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

bash
# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

python
import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

python
from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

FieldDescription
urlOriginal URL
final_urlAfter redirects
content_verdictok, empty, paywall, error, etc.
textExtracted text content
file_pathPath to raw download
markdown_pathPath to markdown (if enabled)
from_cacheWhether result came from cache
content_sha256Content hash for change detection

Environment Variables

VariablePurpose
BRAVE_API_KEYEnable Brave search fallbacks
FETCHER_EMIT_MARKDOWNGenerate LLM-friendly markdown
FETCHER_EMIT_FIT_MARKDOWNGenerate pruned markdown
FETCHER_DOWNLOAD_MODEtext, download_only, rolling_extract
FETCHER_HTTP_CACHE_DISABLEDisable HTTP caching
FETCHER_ENABLE_PDF_DISCOVERYAuto-fetch embedded PDFs

Troubleshooting

ProblemSolution
Playwright missinguv run playwright install --with-deps chromium
Rate limitedConfigure proxy rotation or reduce concurrency
Paywall detectedCheck content_verdict and use alternates
Empty contentCheck junk_results.jsonl for diagnosis

Run fetcher doctor to check environment and dependencies.