Crawl Site
Extract data from multiple pages or handle dynamically loaded lists.
Trigger
The user wants to:
- •Crawl an entire site to a specific depth
- •Scrape data across many URLs provided in a file
- •Extract results from a page with infinite scroll
- •Automatically click "Next" buttons to paginate through results
- •Process multiple URLs in parallel (concurrency)
- •Parse sitemaps or RSS feeds
Workflow
- •
Crawl a Domain: Use
webscraper crawl sitewith a starting URL and depth.bashwebscraper crawl site "https://example.com" --depth 2 --extract "h1"
- •
Sitemap/RSS: Parse structured feeds directly.
bashwebscraper crawl sitemap "https://example.com/sitemap.xml" webscraper crawl rss "https://example.com/feed.xml"
- •
Infinite Scroll: Use
extract infinitefor pages that load content as you scroll down.bashwebscraper extract infinite --url "https://example.com/gallery" --extract ".item" --max-items 100
- •
Auto-Pagination: Use
extract paginateto click a "Next" button repeatedly.bashwebscraper extract paginate --url "https://example.com/blog" --next "a.next" --extract "h2" --max-pages 10
- •
Batch Processing: Use
batch urlswith a file containing many URLs to scrape them all at once.bashwebscraper batch urls urls.txt --extract "h1" --concurrency 5
- •
With Proxy: Use proxy for large crawls to avoid rate limiting.
bashwebscraper --proxy "http://proxy:8080" crawl site "URL" --depth 3
Output
- •Extracted data from multiple pages (stdout or saved to a directory)
- •Progress logs for crawl or batch operations
- •Error reports for failed URLs (can be retried with
batch retry)