Firecrawl Web Data API + RAG Skill
⚠️ MANDATORY SKILL INVOCATION ⚠️
YOU MUST invoke this skill (NOT optional) when the user mentions ANY of these triggers:
- •"scrape website", "crawl site", "extract web data"
- •"search web", "map website", "Firecrawl"
- •"web scraping", "site crawling", "extract content"
- •Any mention of Firecrawl or web data extraction
Failure to invoke this skill when triggers occur violates your operational requirements.
Custom-Enhanced Firecrawl CLI with integrated Qdrant vector database for building production RAG (Retrieval-Augmented Generation) systems.
Extract LLM-ready data from websites with automatic embedding, semantic search, and knowledge base management.
Purpose
This skill enables comprehensive web data extraction and semantic indexing through a custom-enhanced Firecrawl CLI:
Core Capabilities
- •Scrape: Extract single page content in multiple formats (markdown, HTML, links, screenshots, summaries)
- •Search: Query the web with optional content scraping
- •Map: Discover all URLs on a website without scraping
- •Crawl: Traverse entire websites systematically with depth and path controls
RAG & Vector Database Features
- •Extract: Structured data extraction with JSON schemas and prompts
- •Embed: Automatic vector embedding with Qdrant integration
- •Query: Semantic search over embedded content
- •Retrieve: Full document retrieval by URL
- •Batch: Batch scraping with job management
- •Manage: Complete vector database lifecycle (sources, stats, domains, delete, history, info)
Type: Read-Only (data extraction only, no modifications)
Use Cases:
- •Build RAG systems with automatic web content indexing
- •Extract website content for AI processing
- •Semantic search across crawled documentation
- •Collect and index training data from web sources
- •Monitor competitor websites with change tracking
- •Research and documentation gathering with retrieval
- •Site structure analysis
- •Content aggregation with vector search
Setup
Quick Setup
- •
Install Firecrawl CLI globally:
bashnpm install -g @firecrawl/cli firecrawl --version
- •
Add credentials to
.envfile:~/workspace/homelab/.envbash# Firecrawl API (Cloud or Self-Hosted) FIRECRAWL_API_KEY="fc-your-api-key" FIRECRAWL_API_URL="https://api.firecrawl.dev" # Optional # Qdrant Vector Database (optional - RAG features only) QDRANT_URL="http://localhost:6333" QDRANT_API_KEY="" # Optional for cloud QDRANT_COLLECTION="firecrawl" # Text Embeddings Inference (optional - RAG features only) TEI_URL="http://localhost:8080" TEI_MODEL="BAAI/bge-small-en-v1.5" TEI_DIMENSIONS=384
- •
Get Firecrawl API key: Visit https://firecrawl.dev/ → Account → API Keys → Generate New Key
For detailed setup instructions, see README.md
Core Commands
Quick syntax reference. For complete command reference with all parameters, see references/commands.md
Scrape Single URL
firecrawl scrape <url> [options]
Common options: --only-main-content, --format, --wait-for, --screenshot, --no-embed, -o
Search the Web
firecrawl search "<query>" [options]
Common options: --limit, --scrape, --sources, --tbs, --location, -o
Map Website URLs
firecrawl map <url> [options]
Common options: --limit, --search, --include-subdomains, --sitemap, --exclude-paths, --exclude-extensions, --no-default-excludes, --no-filtering, --verbose, --json, -o
NEW: URL filtering support with 143 default exclude patterns (language routes, blog paths, WordPress, etc.)
Crawl Entire Website
firecrawl crawl <url> [options]
Common options: --limit, --max-depth, --include-paths, --exclude-paths, --delay, --wait, --progress
Extract Structured Data
firecrawl extract <url> --prompt "<prompt>" [options]
Common options: --schema, --system-prompt, --enable-web-search, --show-sources
For all parameters and detailed examples, see references/parameters.md
RAG & Vector Database Commands
Quick syntax reference. For complete RAG documentation, see references/vector-database.md
Embed Content
firecrawl embed <url|file|stdin> [options]
Common options: --url, --no-chunk, --collection
Query Semantic Search
firecrawl query "<text>" [options]
Common options: --limit, --domain, --full, --group, --collection
Retrieve Document
firecrawl retrieve <url> [options]
Common options: --collection, -o
Database Management
firecrawl sources [options] # List all indexed URLs firecrawl stats [options] # Database statistics firecrawl domains [options] # List unique domains firecrawl delete --url <url> --yes # Delete vectors firecrawl history [options] # View indexing history firecrawl info <url> [options] # URL details
Note: Auto-embedding is enabled by default. Use --no-embed to disable.
Job Management
Quick syntax reference. For complete job management documentation, see references/job-management.md
Check Status
firecrawl status [options] # Show all active jobs firecrawl status --crawl <job-id> # Check crawl job firecrawl status --batch <job-id> # Check batch job firecrawl status --extract <job-id> # Check extract job firecrawl status --embed [job-id] # Check embedding queue
Batch Operations
firecrawl batch <url1> <url2> ... [options] # Start batch scrape firecrawl batch status <job-id> # Check batch status firecrawl batch cancel <job-id> # Cancel batch job firecrawl batch errors <job-id> # Get batch errors
List & Manage
firecrawl list # List active crawl jobs firecrawl embed cancel <job-id> # Cancel embedding job
Workflows
When extracting web data:
- •
Choose extraction method:
- •Single page → Use
scrape - •Search web → Use
search - •Discover URLs → Use
map - •Full website → Use
crawl
- •Single page → Use
- •
Select output format:
- •LLM processing → Use markdown (
--format markdown) - •Preserve structure → Use HTML (
--format html) - •Extract links → Use links (
--format links) - •Visual reference → Use screenshot (
--screenshot)
- •LLM processing → Use markdown (
- •
Configure options:
- •JavaScript sites → Add
--wait-for <ms> - •Clean content → Add
--only-main-content - •Rate limiting → Add
--delay <ms> - •Path filtering (crawl) → Add
--include-pathsor--exclude-paths - •URL filtering (map) → Add
--exclude-paths,--exclude-extensions, or--no-filtering
- •JavaScript sites → Add
- •
Execute and save:
- •Output to file → Add
-o <path> - •Pretty JSON → Add
--pretty - •Progress tracking → Add
--progress(crawl only)
- •Output to file → Add
- •
Process results:
- •Single format returns raw content
- •Multiple formats return JSON object
- •Pipe to other tools or save to file
RAG Workflow
Build a Documentation RAG System:
# Step 1: Scrape documentation (auto-embeds by default) firecrawl scrape https://docs.example.com/guide # Step 2: Crawl entire docs site (auto-embeds all pages) firecrawl crawl https://docs.example.com --include-paths "/docs/*" --wait --progress # Step 3: Query the embedded content firecrawl query "How do I configure authentication?" # Step 4: Retrieve full document firecrawl retrieve https://docs.example.com/auth-guide
Extract and Index Product Data:
# Extract structured data firecrawl extract https://example.com/products \ --prompt "Extract product name, price, description" \ --show-sources # Query products firecrawl query "products under $50 with free shipping"
Database Maintenance:
# Show database stats firecrawl stats --verbose # List all sources firecrawl sources # Delete old content firecrawl delete --domain old.example.com --yes # View cleanup history firecrawl history --days 30
Parameter Constraints
CRITICAL: Do NOT add --limit, --max-depth, or other constraint parameters unless the user explicitly requests them.
- •❌ Don't: Automatically add
--limit 10or--max-depth 3 - •✅ Do: Only add constraints when user says "max 10 results", "depth 3", etc.
- •✅ Do: Let operations run unlimited by default - user will stop them if needed
- •✅ Do: Trust the user to set appropriate limits for their use case
Notes
Auto-Embedding Behavior
Default: All scrape, crawl, search, and extract operations automatically embed content into Qdrant.
Disable embedding:
- •Add
--no-embedflag to any command - •Content is scraped but NOT embedded
When to disable:
- •Testing scraping logic
- •Extracting data for non-RAG purposes
- •Avoiding duplicate embeddings
- •Performance-sensitive operations
URL Filtering (Map Command)
Default Behavior: Map command applies 143 default exclude patterns to filter unwanted URLs:
- •Language routes (e.g.,
/en/,/de/,/fr/,/es/) - •Blog paths (e.g.,
/blog/,/news/,/article/) - •WordPress admin (e.g.,
/wp-admin,/wp-login) - •Common excludes (e.g.,
/login,/logout,/cart,/checkout)
Filtering Options:
- •
--exclude-paths <paths...>- Add custom patterns (merged with defaults) - •
--exclude-extensions <exts...>- Filter file types (e.g.,.pdf,.zip,.exe) - •
--no-default-excludes- Skip default patterns (only apply custom excludes) - •
--no-filtering- Master override: Disable ALL filtering (defaults + custom) - •
--verbose- Show excluded URLs and which patterns matched
Pattern Matching:
- •Simple patterns use substring matching:
/blog/matches any URL containing/blog/ - •Regex patterns auto-detected:
\.pdf$matches URLs ending with.pdf - •Invalid regex patterns are caught and logged, filtering continues with remaining patterns
Examples:
# Default filtering (excludes language routes, blog, wp-admin, etc.) firecrawl map docs.example.com --limit 50 # Custom excludes (merged with defaults) firecrawl map example.com --exclude-paths /api,/admin --exclude-extensions .pdf # No defaults, only custom excludes firecrawl map example.com --no-default-excludes --exclude-paths /test # Disable all filtering firecrawl map example.com --exclude-paths /api --no-filtering # Verbose mode (see what was filtered) firecrawl map example.com --verbose
Authentication
- •Cloud API: Requires API key (get from https://firecrawl.dev/)
- •Self-hosted: Skip authentication automatically
- •API key sources:
.envfile, CLI login, or--api-keyflag - •Environment variable:
FIRECRAWL_API_KEYis read by CLI
Output Behavior
- •Single format: Returns raw content (pipe-friendly)
- •Multiple formats: Returns JSON object with all formats
- •
--prettyflag: Human-readable JSON formatting - •
-oflag: Saves to file instead of stdout
Performance Tips
- •Use
--only-main-contentto reduce data size - •Set
--delayand--max-concurrencyto avoid rate limits - •Filter paths with
--include-paths/--exclude-paths(crawl) or--exclude-paths/--exclude-extensions(map) - •Use
--max-depthto control crawl scope - •Map first with URL filtering, then crawl specific paths for large sites
- •Use
--no-embedwhen testing scraping logic - •Batch operations for multiple URLs
- •Use
--no-filteringto bypass all URL filters when you need complete results
Vector Database
- •Collection: Configured via
QDRANT_COLLECTIONenv var - •Chunking: Automatic by default (disable with
--no-chunk) - •Metadata: Stores URL, domain, source command, timestamp
- •Deduplication: Automatic by URL and content hash
Reference
Core Documentation:
- •README.md - Complete user-facing documentation with setup
- •Firecrawl Official Docs - Official Firecrawl documentation
- •Firecrawl CLI Docs - Official CLI documentation
- •Firecrawl GitHub - Firecrawl repository
- •Firecrawl CLI GitHub - CLI repository
Reference Files (Detailed Documentation):
- •references/commands.md - Complete command reference with all parameters
- •references/parameters.md - All parameters organized by category
- •references/job-management.md - Job management and async operations
- •references/vector-database.md - RAG features, Qdrant, and semantic search
- •references/quick-reference.md - Quick command examples
- •references/troubleshooting.md - Common issues and solutions
- •references/api-endpoints.md - API endpoint reference
Example Scripts (Working Code):
- •examples/README.md - Examples overview and usage guide
- •examples/basic-scrape.sh - Simple scraping workflows
- •examples/rag-pipeline.sh - Complete RAG system build
- •examples/batch-processing.sh - Batch scraping operations
- •examples/monitor-website.sh - Change monitoring workflow
Additional Resources:
- •Qdrant Documentation - Vector database docs
- •Text Embeddings Inference (TEI) - TEI GitHub
- •Hugging Face Embedding Models - Model catalog
Wrapper Scripts
For common operations, use provided wrapper scripts in scripts/:
# Scrape with standard settings ./scripts/scrape.sh https://example.com # Search and scrape top results ./scripts/search-scrape.sh "AI agents" 5 # Map website to file ./scripts/map-site.sh https://example.com sitemap.json # Crawl with progress ./scripts/crawl-site.sh https://example.com 100 3
All scripts:
- •Source credentials from
~/workspace/homelab/.env - •Include error handling and validation
- •Return JSON output where appropriate
- •Support
--helpflag
🔧 Agent Tool Usage Requirements
CRITICAL: When invoking scripts from this skill via the zsh-tool, ALWAYS use pty: true.
Without PTY mode, command output will not be visible even though commands execute successfully.
Correct invocation pattern:
<invoke name="mcp__plugin_zsh-tool_zsh-tool__zsh"> <parameter name="command">./skills/firecrawl/scripts/SCRIPT.sh [args]</parameter> <parameter name="pty">true