Web Scraping Skill
Extract structured data from websites using Playwright MCP for browser automation and dynamic content handling.
Capabilities
- •Dynamic page scraping (JavaScript-rendered content)
- •Form submission and interaction
- •Multi-page crawling
- •Screenshot capture
- •PDF generation from pages
- •Authentication handling (where ethical)
MCP Integration
Uses: @modelcontextprotocol/server-puppeteer (if available)
Fallback: Manual Playwright scripts
Use Cases
Data Collection
bash
"scrape top 100 prompts from prompthero.com
Extract: prompt text, category, likes, model used
Save to: temp/scraped-data/prompts-{timestamp}.json"
Competitive Intelligence
bash
"scrape competitor pricing pages: - example.com/pricing - competitor2.com/pricing Extract: plans, features, prices Compare with our roadmap Save: temp/research/competitive-pricing.json"
Design Inspiration
bash
"scrape these design showcase sites: - awwwards.com (top 10 sites this month) - dribbble.com (top UI designs) Take full-page screenshots Save to: temp/design-inspiration/"
Documentation Extraction
bash
"scrape LangGraph documentation Extract all code examples for supervisor pattern Save to: temp/research/langgraph-examples.md"
Output Formats
Structured Data (JSON)
json
{
"source": "https://example.com",
"scraped_at": "2025-12-31T10:00:00Z",
"data": [
{
"title": "...",
"content": "...",
"metadata": {}
}
]
}
Screenshots
- •Location:
temp/screenshots/{site}-{timestamp}.png - •Format: PNG, 1920x1080
- •Options: Full page or viewport
Ethical Guidelines
MUST FOLLOW:
- •✅ Respect robots.txt
- •✅ Rate limit: Max 1 request per second
- •✅ Only scrape public data
- •✅ Attribute sources
- •✅ Check Terms of Service
NEVER:
- •❌ Bypass authentication without permission
- •❌ Solve CAPTCHAs automatically
- •❌ Scrape personal/private data
- •❌ Overload servers (DDoS)
- •❌ Violate copyright
Usage Examples
Basic Scraping
bash
"Using Playwright MCP, scrape https://example.com/blog Extract all article titles and URLs Save to temp/scraped-articles.json"
Interactive Scraping
bash
"Using Playwright MCP: 1. Navigate to https://example.com/search 2. Enter query: 'machine learning' 3. Click search button 4. Wait for results to load 5. Extract first 20 results 6. Save to temp/search-results.json"
Multi-Page Crawling
bash
"Using Playwright MCP, crawl paginated list: Start: https://example.com/items?page=1 Extract: item name, price, description Continue: until no 'Next' button or max 100 pages Save: temp/items-catalog.json"
Screenshot Collection
bash
"Using Playwright MCP, take screenshots:
Sites: shadcn.com, ui.aceternity.com, magicui.design
Type: Full-page screenshots
Save: temp/design-inspiration/{site-name}.png"
Best Practices
- •Always check robots.txt first
- •Use user-agent string identifying yourself
- •Respect rate limits (1 req/sec default)
- •Cache results to avoid re-scraping
- •Handle errors gracefully (404, timeout, etc.)
- •Validate data before saving
Remember: Scrape responsibly. Respect website owners and terms of service!