AgentSkillsCN

firecrawl

由Firecrawl API驱动的全面网络爬虫、抓取与数据提取工具包。 提供单页爬虫脚本(scrape.py)、网络搜索脚本(search.py)、 URL发现脚本(map.py)、多页抓取脚本(crawl.py)、结构化数据提取脚本(extract.py), 以及自主数据采集脚本(agent.py)。 适用于以下场景:(1) 从网页中提取内容,(2) 搜索并爬取网络, (3) 发现网站上的URL,(4) 抓取多页内容,(5) 通过JSON Schema提取结构化数据, 或(6) 自主从网络各处收集数据。 需设置FIRECRAWL_API_KEY环境变量。

SKILL.md
--- frontmatter
name: firecrawl
description: |
  Comprehensive web scraping, crawling, and data extraction toolkit powered by Firecrawl API.
  Provides scripts for single-page scraping (scrape.py), web search (search.py),
  URL discovery (map.py), multi-page crawling (crawl.py), structured data extraction (extract.py),
  and autonomous data gathering (agent.py).
  Use when you need to: (1) extract content from web pages, (2) search and scrape the web,
  (3) discover URLs on websites, (4) crawl multiple pages, (5) extract structured data with JSON schemas,
  or (6) autonomously gather data from anywhere on the web.
  Requires FIRECRAWL_API_KEY environment variable.

Firecrawl Web Scraping & Data Extraction

Installation

bash
pip install firecrawl-py

Environment Setup

Set your Firecrawl API key:

bash
export FIRECRAWL_API_KEY="your-api-key-here"

Scripts

scrape.py - Single Page Scraping

The most powerful and reliable scraper. Use when you know exactly which page contains the information.

bash
# Basic scrape (returns markdown)
./scripts/scrape.py "https://example.com"

# Get HTML format
./scripts/scrape.py "https://example.com" --format html

# Extract only main content (removes headers, footers, etc.)
./scripts/scrape.py "https://example.com" --only-main

# Combine options
./scripts/scrape.py "https://docs.example.com/api" --format markdown --only-main

search.py - Web Search

Search the web when you don't know which website has the information.

bash
# Basic search
./scripts/search.py "latest AI research papers 2024"

# Limit results
./scripts/search.py "Python web scraping tutorials" --limit 5

# Search with scraping (get full content)
./scripts/search.py "firecrawl documentation" --limit 3

map.py - URL Discovery

Discover all URLs on a website. Use before deciding what to scrape.

bash
# Map a website
./scripts/map.py "https://docs.example.com"

# Limit number of URLs
./scripts/map.py "https://example.com" --limit 100

# Search within mapped URLs
./scripts/map.py "https://docs.example.com" --search "authentication"

crawl.py - Multi-Page Crawling

Extract content from multiple related pages. Warning: can be slow and return large results.

bash
# Basic crawl
./scripts/crawl.py "https://docs.example.com"

# Limit pages
./scripts/crawl.py "https://docs.example.com" --limit 20

# Control crawl depth
./scripts/crawl.py "https://docs.example.com" --limit 10 --depth 2

extract.py - Structured Data Extraction

Extract specific structured data using LLM capabilities.

bash
# Extract with prompt
./scripts/extract.py "https://example.com/pricing" \
  --prompt "Extract all pricing tiers with their features and prices"

# Extract with JSON schema
./scripts/extract.py "https://example.com/team" \
  --prompt "Extract team member information" \
  --schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'

# Extract from multiple URLs
./scripts/extract.py "https://example.com/page1" "https://example.com/page2" \
  --prompt "Extract product information"

agent.py - Autonomous Data Gathering

Autonomous agent that searches, navigates, and extracts data from anywhere on the web.

bash
# Simple research task
./scripts/agent.py --prompt "Find the founders of Firecrawl and their backgrounds"

# Complex data gathering
./scripts/agent.py --prompt "Find the top 5 AI startups founded in 2024 and their funding amounts"

# Focus on specific URLs
./scripts/agent.py \
  --prompt "Compare the features and pricing" \
  --urls "https://example1.com,https://example2.com"

# With output schema
./scripts/agent.py \
  --prompt "Find recent tech layoffs" \
  --schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'

Output Format

All scripts output JSON to stdout. Errors are written to stderr.

Success Response

json
{
  "success": true,
  "data": { ... }
}

Error Response

json
{
  "success": false,
  "error": "Error message"
}

Tips

  1. Performance: Use scrape for single pages - it's 500% faster with caching
  2. Discovery: Use map first to find URLs, then scrape specific pages
  3. Large sites: Prefer map + scrape over crawl for better control
  4. Structured data: Use extract with a JSON schema for consistent output
  5. Research: Use agent when you don't know where to find the data