Crawl Recipe Skill
Execute web crawling recipes exported from the Crawl-Bot Chrome extension.
Recipe JSON Format
A crawl recipe defines how to extract structured data from web pages:
typescript
interface CrawlRecipeExport {
$schema: 'https://crawl-bot/recipe.schema.json';
name: string;
url_pattern: string; // URL pattern to match (e.g., "https://example.com/products/*")
version: '1.0';
fields: ExportField[];
pagination?: PaginationConfig;
}
interface ExportField {
field_name: string; // Snake_case field name
selector: string; // CSS selector
selector_type: 'css';
fallback_selectors?: string[]; // Alternative selectors if main fails
extract: ExtractConfig;
transforms: TransformStep[];
multiple: boolean; // Whether to extract all matches or just first
list_container?: string; // CSS selector for list container (if extracting from list)
}
interface ExtractConfig {
type: 'text' | 'html' | 'attribute';
attribute?: string; // Required when type is 'attribute' (e.g., "href", "src")
}
interface TransformStep {
type: 'trim' | 'strip_html' | 'extract_number' | 'regex' | 'replace' | 'default';
pattern?: string; // For regex/replace
replacement?: string; // For replace
default_value?: string; // For default transform
}
interface PaginationConfig {
type: 'next_button' | 'url_pattern' | 'infinite_scroll';
selector?: string; // For next_button: CSS selector of next page button
url_template?: string; // For url_pattern: e.g., "https://example.com/page/{page}"
max_pages?: number;
wait_ms?: number;
}
Quick Start
1. Execute a Recipe
bash
python .agents/skills/crawl-recipe/scripts/execute_recipe.py \ --recipe product-scraper.recipe.json \ --url "https://example.com/products" \ --output results.json
2. Output Formats
The script supports both JSON and CSV output:
bash
# JSON output (default) python execute_recipe.py --recipe recipe.json --url URL --output data.json # CSV output python execute_recipe.py --recipe recipe.json --url URL --output data.csv --format csv
3. Headless Mode
Run in headless mode (no browser window):
bash
python execute_recipe.py --recipe recipe.json --url URL --headless
Recipe Execution Flow
code
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Load Recipe │────▶│ Navigate to │────▶│ Extract Data │
│ (JSON file) │ │ URL │ │ (CSS selectors)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Output Results │◄────│ Handle │◄────│ Apply │
│ (JSON/CSV) │ │ Pagination │ │ Transforms │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Field Extraction
Basic Extraction
json
{
"field_name": "title",
"selector": "h1.product-title",
"selector_type": "css",
"extract": { "type": "text" },
"transforms": [{ "type": "trim" }],
"multiple": false
}
Extracting Attributes
json
{
"field_name": "image_url",
"selector": "img.product-image",
"selector_type": "css",
"extract": { "type": "attribute", "attribute": "src" },
"transforms": [],
"multiple": false
}
Extracting Multiple Items
json
{
"field_name": "tags",
"selector": ".tag-item",
"selector_type": "css",
"extract": { "type": "text" },
"transforms": [{ "type": "trim" }],
"multiple": true
}
Fallback Selectors
json
{
"field_name": "price",
"selector": ".price-current",
"selector_type": "css",
"fallback_selectors": [".price", "[data-price]"],
"extract": { "type": "text" },
"transforms": [{ "type": "extract_number" }],
"multiple": false
}
Data Transforms
Transforms are applied in order after extraction:
| Transform | Description | Options |
|---|---|---|
trim | Remove leading/trailing whitespace | - |
strip_html | Remove HTML tags from content | - |
extract_number | Extract numeric value (removes non-digits except decimal point) | - |
regex | Apply regex pattern match | pattern (required) |
replace | Replace substring or regex | pattern, replacement |
default | Set default if value is empty | default_value |
Transform Examples
json
// Extract price as number
"transforms": [
{ "type": "trim" },
{ "type": "extract_number" }
]
// "$1,299.99" → "1299.99"
// Extract using regex
"transforms": [
{ "type": "regex", "pattern": "\\d+" }
]
// "Page 5 of 10" → "5"
// Replace text
"transforms": [
{ "type": "replace", "pattern": "USD", "replacement": "$" }
]
// "100 USD" → "100 $"
// Default value
"transforms": [
{ "type": "trim" },
{ "type": "default", "default_value": "N/A" }
]
// "" → "N/A"
Pagination
Next Button Pagination
json
{
"pagination": {
"type": "next_button",
"selector": "a.next-page",
"max_pages": 10,
"wait_ms": 1000
}
}
URL Pattern Pagination
json
{
"pagination": {
"type": "url_pattern",
"url_template": "https://example.com/products?page={page}",
"max_pages": 5,
"wait_ms": 1500
}
}
Infinite Scroll
json
{
"pagination": {
"type": "infinite_scroll",
"max_pages": 20,
"wait_ms": 2000
}
}
Complete Example
json
{
"$schema": "https://crawl-bot/recipe.schema.json",
"name": "E-commerce Product Scraper",
"url_pattern": "https://example.com/products/*",
"version": "1.0",
"fields": [
{
"field_name": "product_name",
"selector": "h1.product-title",
"selector_type": "css",
"extract": { "type": "text" },
"transforms": [{ "type": "trim" }],
"multiple": false
},
{
"field_name": "price",
"selector": ".price",
"selector_type": "css",
"fallback_selectors": ["[data-price]"],
"extract": { "type": "text" },
"transforms": [
{ "type": "trim" },
{ "type": "extract_number" }
],
"multiple": false
},
{
"field_name": "description",
"selector": ".product-description",
"selector_type": "css",
"extract": { "type": "html" },
"transforms": [{ "type": "strip_html" }, { "type": "trim" }],
"multiple": false
},
{
"field_name": "image_urls",
"selector": ".gallery img",
"selector_type": "css",
"extract": { "type": "attribute", "attribute": "src" },
"transforms": [],
"multiple": true
}
],
"pagination": {
"type": "next_button",
"selector": "a.pagination-next",
"max_pages": 5,
"wait_ms": 1500
}
}
Script Usage
code
usage: execute_recipe.py [-h] --recipe RECIPE --url URL [--output OUTPUT]
[--format {json,csv}] [--headless] [--timeout TIMEOUT]
[--wait WAIT]
Execute a crawl recipe using Playwright
options:
-h, --help show this help message and exit
--recipe RECIPE, -r RECIPE
Path to the recipe JSON file
--url URL, -u URL Starting URL to crawl
--output OUTPUT, -o OUTPUT
Output file path (default: results.json)
--format {json,csv}, -f {json,csv}
Output format (default: json)
--headless Run browser in headless mode
--timeout TIMEOUT Page load timeout in seconds (default: 30)
--wait WAIT Additional wait time after page load in ms (default: 0)
Error Handling
The script handles common scenarios:
- •Missing selectors: Returns
nullfor fields that don't match - •Network errors: Retries with exponential backoff
- •Pagination end: Stops when next button is disabled/missing
- •Rate limiting: Respects
wait_msbetween pages
Tips
- •Test selectors first: Use browser DevTools to verify CSS selectors
- •Use fallbacks: Add fallback selectors for more robust extraction
- •Set appropriate wait times: Account for JavaScript-rendered content
- •Handle dynamic content: Use longer
wait_msfor SPAs and lazy-loaded content - •Respect robots.txt: Check website's robots.txt before crawling