AgentSkillsCN

scrapy-web-scraping

从网站中抓取结构化数据,并以 JSONL、CSV 或 XML 格式导出。适用于从支持分页的网页中提取报价、文章或表格数据,并可根据需求自定义输出格式。

SKILL.md
--- frontmatter
name: scrapy-web-scraping
description: Scrape structured data from websites and export to JSONL, CSV, or XML formats. Use when extracting quotes, articles, or tabular data from web pages with pagination support and configurable output formats.

Scrapy: Web Scraping Framework

Overview

Scrapy is a fast, high-level web scraping framework for Python. These scripts implement spiders that crawl web pages, extract structured data (e.g., quotes with authors), follow pagination links, and export results in multiple formats.

Quick Start

bash
python scripts/quotes_jsonl.py output.jsonl
# Optional: override source (URL or file://...)
# python scripts/quotes_jsonl.py --source "file:///workspace/data/fixtures/scrapy/page1.html" output.jsonl

Scripts

scripts/quotes_jsonl.py

Scrape quotes and export as JSON Lines (one JSON object per line).

bash
python scripts/quotes_jsonl.py <output.jsonl>

scripts/quotes_csv.py

Scrape quotes and export as CSV with author and text columns.

bash
python scripts/quotes_csv.py <output.csv>

scripts/quotes_xml.py

Scrape quotes and export as XML format.

bash
python scripts/quotes_xml.py <output.xml>

Parameters (all scripts):

  • First positional argument — Output file path
  • --output — Output file path (alternative to positional argument)
  • --source — Start source URL (supports file:// local fixtures)
  • --log-level — Scrapy log level

Output Schema

Each record contains:

  • author — Quote author name
  • text — Full quote text

Important Notes

  • Default target — Uses local fixture data/fixtures/scrapy/page1.html for offline reproducibility
  • Pagination — Automatically follows "next" page links
  • Dependency — Requires scrapy package (pip install scrapy)