Firecrawl Web Scraping & Data Extraction

Installation

bash

pip install firecrawl-py

Environment Setup

Set your Firecrawl API key:

bash

export FIRECRAWL_API_KEY="your-api-key-here"

Scripts

scrape.py - Single Page Scraping

The most powerful and reliable scraper. Use when you know exactly which page contains the information.

bash

# Basic scrape (returns markdown)
./scripts/scrape.py "https://example.com"

# Get HTML format
./scripts/scrape.py "https://example.com" --format html

# Extract only main content (removes headers, footers, etc.)
./scripts/scrape.py "https://example.com" --only-main

# Combine options
./scripts/scrape.py "https://docs.example.com/api" --format markdown --only-main

search.py - Web Search

Search the web when you don't know which website has the information.

bash

# Basic search
./scripts/search.py "latest AI research papers 2024"

# Limit results
./scripts/search.py "Python web scraping tutorials" --limit 5

# Search with scraping (get full content)
./scripts/search.py "firecrawl documentation" --limit 3

map.py - URL Discovery

Discover all URLs on a website. Use before deciding what to scrape.

bash

# Map a website
./scripts/map.py "https://docs.example.com"

# Limit number of URLs
./scripts/map.py "https://example.com" --limit 100

# Search within mapped URLs
./scripts/map.py "https://docs.example.com" --search "authentication"

crawl.py - Multi-Page Crawling

Extract content from multiple related pages. Warning: can be slow and return large results.

bash

# Basic crawl
./scripts/crawl.py "https://docs.example.com"

# Limit pages
./scripts/crawl.py "https://docs.example.com" --limit 20

# Control crawl depth
./scripts/crawl.py "https://docs.example.com" --limit 10 --depth 2

extract.py - Structured Data Extraction

Extract specific structured data using LLM capabilities.

bash

# Extract with prompt
./scripts/extract.py "https://example.com/pricing" \
  --prompt "Extract all pricing tiers with their features and prices"

# Extract with JSON schema
./scripts/extract.py "https://example.com/team" \
  --prompt "Extract team member information" \
  --schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'

# Extract from multiple URLs
./scripts/extract.py "https://example.com/page1" "https://example.com/page2" \
  --prompt "Extract product information"

agent.py - Autonomous Data Gathering

Autonomous agent that searches, navigates, and extracts data from anywhere on the web.

bash

# Simple research task
./scripts/agent.py --prompt "Find the founders of Firecrawl and their backgrounds"

# Complex data gathering
./scripts/agent.py --prompt "Find the top 5 AI startups founded in 2024 and their funding amounts"

# Focus on specific URLs
./scripts/agent.py \
  --prompt "Compare the features and pricing" \
  --urls "https://example1.com,https://example2.com"

# With output schema
./scripts/agent.py \
  --prompt "Find recent tech layoffs" \
  --schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'

Output Format

All scripts output JSON to stdout. Errors are written to stderr.

Success Response

json

{
  "success": true,
  "data": { ... }
}

Error Response

json

{
  "success": false,
  "error": "Error message"
}

Tips

•Performance: Use scrape for single pages - it's 500% faster with caching
•Discovery: Use map first to find URLs, then scrape specific pages
•Large sites: Prefer map + scrape over crawl for better control
•Structured data: Use extract with a JSON schema for consistent output
•Research: Use agent when you don't know where to find the data