Site Harvest Skill
Purpose: Extract complete website content, design system, and assets for rebuilding or migration.
Trigger: /site-harvest [url] or "harvest [url]"
Architecture: Token-Efficient Hybrid
| Task | Tool | Why |
|---|---|---|
| URL Discovery | Firecrawl map | Their tokens, instant, free 500 pages |
| Bulk Content | Firecrawl crawl | Their tokens, parallel extraction |
| Structured Data | Firecrawl extract | Schema-based, accurate |
| Sitemap Analysis | Claude + WebFetch | Quick XML parse, freshness check |
| Screaming Frog | CSV import | Pre-crawled, comprehensive |
| Design Screenshots | Firecrawl scrape | Branding format extracts design |
| Side-by-Side Compare | Chrome Tab | Only for old vs new comparison |
Token Strategy: Firecrawl does heavy lifting (their tokens) → Chrome only for visual comparison (our tokens)
Prerequisites
- •Firecrawl MCP - Primary extraction engine
- •Chrome Tab MCP - Only for side-by-side comparison phase
- •URL sources (one or more):
- •Site URL (Firecrawl discovers pages)
- •Screaming Frog export (CSV)
- •sitemap.xml (auto-detected)
Quick Start
/site-harvest https://example.com
9-Phase Workflow
Phase 1: URL Discovery & Site Structure (FOUNDATION)
⚠️ THIS IS THE BASIS FOR EVERYTHING. Get this wrong = rebuild has gaps.
- •Ask for Screaming Frog export first (most comprehensive)
- •Analyse sitemap.xml via WebFetch (freshness, coverage)
- •Run Firecrawl map (live discoverable pages)
- •Check AJAX/Pagination (load more, infinite scroll, WordPress API)
- •Four-way comparison: Sitemap vs Firecrawl vs SF vs AJAX
- •Document site structure (page types, navigation, hierarchy)
- •Report findings and wait for confirmation
Detailed instructions: See references/url-discovery.md
Save: /url-discovery.json
Phase 2: Content Extraction (Firecrawl)
firecrawl_crawl({
url: "https://example.com",
limit: [merged URL count],
scrapeOptions: {
formats: ["markdown", "html", "links"],
onlyMainContent: true
}
})
For each page, save:
- •/pages/[slug].md (clean markdown)
- •/pages/[slug].json (structured: title, meta, headings)
Extract media references (images, videos, documents).
Save: /content-manifest.json
Phase 3: Design System Extraction
// Branding extraction
firecrawl_scrape({
url: "https://example.com",
formats: ["branding"]
})
// Full CSS capture
firecrawl_scrape({
url: "https://example.com",
formats: ["html", "rawHtml"]
})
- •Parse all stylesheet URLs and download CSS files
- •Extract CSS variables (--color-, --font-, --spacing-*)
- •Capture typography scale (h1-h6, p, small)
Save: /design-tokens.json
Phase 4: Component Style Catalogue (EXHAUSTIVE)
Capture EVERY visual pattern - this prevents "footer links unstyled" problems.
Extract computed styles for:
- •Navigation (header, links, mobile menu, dropdowns)
- •Footer (container, columns, links, social icons)
- •Typography (headings, paragraphs, lists, blockquotes)
- •Buttons & CTAs (primary, secondary, ghost, hover states)
- •Sections (padding, backgrounds, alternating patterns)
- •Dividers (hr, borders, SVG waves, clip-path angles)
- •Cards (container, hover, image, content)
- •Icons (download exact SVGs, don't substitute!)
- •Forms (inputs, labels, error states)
Detailed element list: See references/component-styles.md
Save: /component-styles.json
Phase 5: Visual Capture (Screenshots)
firecrawl_scrape({
url: "https://example.com",
formats: [
{ type: "screenshot", fullPage: true, viewport: { width: 1920, height: 1080 } },
{ type: "screenshot", fullPage: true, viewport: { width: 768, height: 1024 } },
{ type: "screenshot", fullPage: true, viewport: { width: 375, height: 812 } }
]
})
Screenshot key pages (homepage, about, services, blog, contact) and components (header, footer, hero, cards, dividers).
Save to: /screenshots/
Phase 6: Asset Download
- •Images: Download all, maintain folder structure → /media/
- •Fonts: Parse @font-face, download all formats → /assets/fonts/
- •Icons: Extract inline SVGs exactly, download external SVGs → /assets/icons/
- •JavaScript: Download external JS, note inline scripts → /assets/scripts/
Phase 7: Theme Skill Generation
Generate: /[project-name]-theme.md
Document:
- •Brand identity (colours with hex + Tailwind classes)
- •Typography (fonts, sizes, weights)
- •Section patterns (padding, backgrounds, dividers)
- •Component specs (buttons, cards, links)
- •Layout patterns (grids, flexbox)
- •Special elements (wave SVGs, icons)
- •Tailwind classes reference
Template: See references/theme-generation.md
This is the single source of truth during rebuild.
Phase 8: Manifest Generation
Generate comprehensive manifest.json with:
- •Harvest metadata (URL, date, tool version)
- •URL discovery results (all sources compared)
- •Pages list with files
- •Design assets references
- •Screenshots index
- •Warnings and flags
Example manifest: See references/output-structure.md
Phase 9: Side-by-Side Comparison (Chrome Tab)
Only runs when BOTH old and new sites exist.
- •Open both sites in Chrome tabs
- •Screenshot both at same viewport
- •Compare: header, hero, sections, footer, dividers, icons
- •Flag mismatches with specifics
- •Generate comparison report
Detailed workflow: See references/rebuild-workflow.md
Critical Rules
❌ DON'T
- •Substitute icons with similar ones from icon libraries
- •Ignore wave/angle dividers
- •Skip footer link styling
- •Assume section spacing without measuring
- •Miss hover/active/focus states
✅ DO
- •Extract EXACT SVG markup for all custom icons
- •Capture ALL divider types (hr, border, SVG, clip-path)
- •Document EVERY link style (nav, footer, inline, CTA)
- •Measure actual padding/margin values
- •Screenshot unusual patterns for reference
Error Handling
| Error | Action |
|---|---|
| Firecrawl rate limit | Wait, retry with smaller batch |
| sitemap.xml missing | Continue with Firecrawl + Screaming Frog |
| CSS file 404 | Log warning, check for inline styles |
| Font file blocked | Note in manifest, may need manual download |
| SVG divider complex | Screenshot + extract raw HTML |
Example Usage
# Basic harvest /site-harvest https://client-site.co.uk # With Screaming Frog export /site-harvest https://example.com --urls screaming-frog-export.csv # Comparison mode (after rebuild) /site-harvest compare https://old-site.com https://new-site.vercel.app
Output Structure
See references/output-structure.md for complete folder layout and manifest example.
/scraped-data/[site-name]/ ├── manifest.json ├── [site-name]-theme.md ├── url-discovery.json ├── design-tokens.json ├── component-styles.json ├── pages/ ├── screenshots/ ├── media/ ├── assets/ └── comparison/