Scrapy: Web Scraping Framework

Overview

Scrapy is a fast, high-level web scraping framework for Python. These scripts implement spiders that crawl web pages, extract structured data (e.g., quotes with authors), follow pagination links, and export results in multiple formats.

Quick Start

bash

python scripts/quotes_jsonl.py output.jsonl
# Optional: override source (URL or file://...)
# python scripts/quotes_jsonl.py --source "file:///workspace/data/fixtures/scrapy/page1.html" output.jsonl

Scripts

`scripts/quotes_jsonl.py`

Scrape quotes and export as JSON Lines (one JSON object per line).

bash

python scripts/quotes_jsonl.py <output.jsonl>

`scripts/quotes_csv.py`

Scrape quotes and export as CSV with author and text columns.

bash

python scripts/quotes_csv.py <output.csv>

`scripts/quotes_xml.py`

Scrape quotes and export as XML format.

bash

python scripts/quotes_xml.py <output.xml>

Parameters (all scripts):

•First positional argument — Output file path
•--output — Output file path (alternative to positional argument)
•--source — Start source URL (supports file:// local fixtures)
•--log-level — Scrapy log level

Output Schema

Each record contains:

•author — Quote author name
•text — Full quote text

Important Notes

•Default target — Uses local fixture data/fixtures/scrapy/page1.html for offline reproducibility
•Pagination — Automatically follows "next" page links
•Dependency — Requires scrapy package (pip install scrapy)