Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via uv run from git (no pre-installation needed).

Simplest Usage

bash

# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

bash

./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

bash

fetcher get https://www.nasa.gov --out run/nasa

Outputs to run/nasa/:

•consumer_summary.json - structured result
•Walkthrough.md - human-readable summary
•downloads/ - raw content files

Fetch multiple URLs

bash

# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

bash

fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

bash

fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

code

run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

bash

export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

bash

export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

bash

# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

bash

# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

bash

export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

bash

# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

python

import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

python

from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

Field	Description
`url`	Original URL
`final_url`	After redirects
`content_verdict`	`ok`, `empty`, `paywall`, `error`, etc.
`text`	Extracted text content
`file_path`	Path to raw download
`markdown_path`	Path to markdown (if enabled)
`from_cache`	Whether result came from cache
`content_sha256`	Content hash for change detection

Environment Variables

Variable	Purpose
`BRAVE_API_KEY`	Enable Brave search fallbacks
`FETCHER_EMIT_MARKDOWN`	Generate LLM-friendly markdown
`FETCHER_EMIT_FIT_MARKDOWN`	Generate pruned markdown
`FETCHER_DOWNLOAD_MODE`	`text`, `download_only`, `rolling_extract`
`FETCHER_HTTP_CACHE_DISABLE`	Disable HTTP caching
`FETCHER_ENABLE_PDF_DISCOVERY`	Auto-fetch embedded PDFs

Troubleshooting

Problem	Solution
Playwright missing	`uv run playwright install --with-deps chromium`
Rate limited	Configure proxy rotation or reduce concurrency
Paywall detected	Check `content_verdict` and use alternates
Empty content	Check `junk_results.jsonl` for diagnosis

Run fetcher doctor to check environment and dependencies.