Add Scraper for a New Website
You are helping the user add support for scraping property listings from a new website. This is a multi-step process that requires creating a mapping file, capturing a test fixture, and wiring everything up.
Inputs
The user will provide $ARGUMENTS which should be a URL of a sample listing page from the target website, or just the website name. If no URL is provided, ask for one.
Step-by-step workflow
Phase 1: Gather information
- •Get a sample URL if the user didn't provide one
- •Determine the scraper name — derive from the hostname (e.g.
www.example-realty.com→example_realty). Use lowercase, underscores for separators, keep it short. Ask the user to confirm the name. - •Get the HTML — Ask the user how they want to provide the HTML:
- •If the site is static (no JS rendering needed), we can fetch it directly
- •If the site is JS-heavy (React/Vue/Angular SPA), the user should save the page from their browser ("Save As" → "Web Page, HTML Only") and provide the file path
- •They can also pipe it:
curl ... | npm run capture-fixture -- --stdin --url <url>
Phase 2: Analyze the HTML and create the mapping
- •
Capture the fixture using the capture-fixture utility:
bashcd astro-app # For URL fetch: npm run capture-fixture -- <url> --name <scraper_name> # For local file: npm run capture-fixture -- --file <path> --url <url> --name <scraper_name> --no-extract
- •
Analyze the saved HTML fixture at
astro-app/test/fixtures/<name>.html. Look for:- •Title: Usually in
<h1>,<title>, orog:titlemeta tag - •Price: Look for price elements, currency symbols,
itemprop="price"microdata - •Address/Location: Address strings, postal codes,
og:meta tags, or structured data - •Coordinates: Often in
<script>tags as JSON (lat/lng), or in meta tags - •Bedrooms/Bathrooms: Usually near icons or labels like "bed", "bath", "BR"
- •Images:
<img>tags in gallery sections, orog:imagemeta tags - •Reference/ID: Property ID in URL path, hidden inputs, or
data-attributes - •For sale/rent: Status indicators, URL patterns (
/to-rent/,/for-sale/) - •Description:
meta[name=description]or content sections - •Area/Size: Square footage/meters, usually near bed/bath counts
- •Title: Usually in
- •
Create the mapping JSON at
config/scraper_mappings/<name>.json. Follow the format described in the reference file at.claude/skills/add-scraper/reference.md.Key rules:
- •Each field must appear in exactly ONE section. If the same field appears in multiple sections, the last one processed wins (processing order: defaultValues → images → features → intFields → floatFields → textFields → booleanFields)
- •Always add
cssCountId: "0"when a CSS selector might match multiple elements - •Use
cssAttrto read attribute values (e.g.contentfrom meta tags) - •Use
scriptRegExfor data embedded in<script>tags - •Use
urlPathPartto extract segments from the URL path - •Prefer class-based or ID-based selectors over
nth-childchains
Phase 3: Validate and register
- •
Run the capture-fixture utility again with extraction enabled to preview results:
bashcd astro-app npm run capture-fixture -- --file test/fixtures/<name>.html --url <source_url> --name <name> --force
- •
Review the extraction preview. If fields are wrong:
- •Check selectors against the actual HTML
- •Fix the mapping JSON
- •Re-run capture-fixture to verify
- •
Add to the hostname map — Add entries for the new hostname to:
- •
astro-app/scripts/capture-fixture.tsin theHOSTNAME_MAPconstant - •
astro-app/src/lib/services/url-validator.tsin theLOCAL_HOST_MAPconstant
- •
- •
Add to the test manifest — Copy the manifest stub from capture-fixture output into
astro-app/test/fixtures/manifest.ts. Review and adjust expected values — remove fields that are zero/empty and not meaningful, ensure types are correct (integers vs strings vs booleans). - •
Run the test suite to verify everything passes:
bashcd astro-app && npx vitest run
Phase 4: Summary
- •Present a summary of all files created/modified and offer to commit the changes.
Important notes
- •The mapping file format is JSON but parsed with JSON5 (comments are allowed)
- •The
namefield in the mapping must match the scraper name used everywhere else - •
defaultValuesalways produce strings (e.g."true"nottrue) - •
stripPunctremoves.and,only — not currency symbols - •
stripStringremoves the first occurrence of an exact substring, runs after split - •Cheerio converts
<br>to\n, never\r - •
parseFloat/parseIntreturn0on failure, not an error
Existing scrapers for reference
These scrapers are already configured: idealista, rightmove, zoopla, realtor, fotocasa, pisos, realestateindia, forsalebyowner, mlslistings, wyomingmls, inmo1, pwb, carusoimmobiliare, cerdfw, weebrix.
See config/scraper_mappings/ for their mapping files and astro-app/test/fixtures/manifest.ts for expected values.