Scrapy: Web Scraping Framework
Overview
Scrapy is a fast, high-level web scraping framework for Python. These scripts implement spiders that crawl web pages, extract structured data (e.g., quotes with authors), follow pagination links, and export results in multiple formats.
Quick Start
bash
python scripts/quotes_jsonl.py output.jsonl # Optional: override source (URL or file://...) # python scripts/quotes_jsonl.py --source "file:///workspace/data/fixtures/scrapy/page1.html" output.jsonl
Scripts
scripts/quotes_jsonl.py
Scrape quotes and export as JSON Lines (one JSON object per line).
bash
python scripts/quotes_jsonl.py <output.jsonl>
scripts/quotes_csv.py
Scrape quotes and export as CSV with author and text columns.
bash
python scripts/quotes_csv.py <output.csv>
scripts/quotes_xml.py
Scrape quotes and export as XML format.
bash
python scripts/quotes_xml.py <output.xml>
Parameters (all scripts):
- •First positional argument — Output file path
- •
--output— Output file path (alternative to positional argument) - •
--source— Start source URL (supportsfile://local fixtures) - •
--log-level— Scrapy log level
Output Schema
Each record contains:
- •
author— Quote author name - •
text— Full quote text
Important Notes
- •Default target — Uses local fixture
data/fixtures/scrapy/page1.htmlfor offline reproducibility - •Pagination — Automatically follows "next" page links
- •Dependency — Requires
scrapypackage (pip install scrapy)