AgentSkillsCN

web-scraping

使用 Playwright MCP 工具进行网页爬取。可根据需求选择 pw-writer(适用于用户 Chrome、复杂站点、100% 可靠)或 pw-fast(适用于无头模式、批量处理、简单站点)。适用于数据爬取、信息提取、浏览器自动化、网络爬虫,或当用户提及“提取”“爬取”“抓取”“浏览器自动化”时使用。

SKILL.md
--- frontmatter
name: web-scraping
description: Web scraping with Playwright MCP tools. Choose pw-writer (user Chrome, complex sites, 100% reliable) or pw-fast (headless, batch mode, simple sites). Use for scraping, data extraction, browser automation, crawling, or when user mentions extract, scrape, crawl, automate browser.
user-invocable: true
version: 1.0.0

Web Scraping Skill

Expert guidance for choosing between pw-writer and pw-fast MCP tools for web scraping.

Quick Decision Tree

code
Is site complex? (SPAs, accordions, cookie dialogs, external redirects, auth)
├── YES → Use pw-writer
│   ├── Token-constrained? → getCleanHTML (40x fewer tokens)
│   ├── Need navigation? → accessibilitySnapshot + aria-ref
│   └── Visual layout complex? → screenshotWithAccessibilityLabels
│
└── NO → Selectors unambiguous? (unique IDs, test-ids)
    ├── YES → Use pw-fast batch_execute (fastest: ~2s)
    └── NO → Use pw-writer (more reliable)

Need to discover hidden APIs?
├── YES → pw-writer network interception
└── NO → DOM-based extraction

Tool Comparison

Aspectpw-writerpw-fast
BrowserUser's Chrome (extension)Headless Chromium
Mean Time22s (getCleanHTML)2s (batch)
Success Rate100% (complex sites)0-33% (complex sites)
Token Usage~239 (getCleanHTML)~2,138
Strict ModeHandles gracefullyFails on multiple matches
Cookies/AuthNative (user session)Manual handling
New Tabs/PopupsFull supportLimited
Best ForComplex sites, reliabilitySimple sites, speed

When to Use Each Tool

Use pw-writer when:

  • Site requires login/authentication (reuses user session)
  • Complex SPAs with dynamic content
  • Sites with cookie consent dialogs
  • Pages with accordions, tabs, or lazy-loaded content
  • External redirects during navigation
  • Multiple elements match selectors (would fail strict mode)
  • Token budget is constrained (use getCleanHTML)
  • Need to discover APIs via network interception

Use pw-fast when:

  • Simple, static pages with unique selectors
  • Batch operations on predictable pages (forms, lists)
  • No authentication required
  • Speed is critical and site structure is known
  • Running automated pipelines (headless)

Quick Start Patterns

pw-writer: Token-Efficient Extraction (RECOMMENDED)

javascript
// Navigate and wait
await page.goto('https://example.com', { waitUntil: 'domcontentloaded' });
await waitForPageLoad({ page, timeout: 5000 });

// Extract with getCleanHTML (40x fewer tokens than snapshot)
const html = await getCleanHTML({
  locator: page.locator('table.data'),
  search: /price|item/i  // Optional: filter results
});
console.log(html);

pw-writer: Navigation with aria-ref

javascript
// Get accessibility snapshot for navigation
console.log(await accessibilitySnapshot({ page, search: /menu|button/i }));

// Click using aria-ref from snapshot (no quotes on ref value!)
await page.locator('aria-ref=e14').click();
await waitForPageLoad({ page });

pw-fast: Batch Execution (Simple Sites)

json
{
  "name": "browser_batch_execute",
  "arguments": {
    "steps": [
      { "tool": "browser_navigate", "arguments": { "url": "https://example.com" }},
      { "tool": "browser_type", "arguments": {
        "selectors": [{ "css": "#search" }], "text": "query"
      }},
      { "tool": "browser_click", "arguments": {
        "selectors": [{ "role": "button", "text": "Search" }]
      }}
    ],
    "globalExpectation": { "includeSnapshot": false }
  }
}

pw-writer: API Discovery (Network Interception)

javascript
// Setup listener
state.responses = [];
page.on('response', async res => {
  if (res.url().includes('/api/')) {
    try {
      state.responses.push({ url: res.url(), body: await res.json() });
    } catch {}
  }
});

// Trigger actions (scroll, click, navigate)
await page.click('button.load-more');

// Analyze captured API responses
console.log(`Captured ${state.responses.length} API calls`);
state.responses.forEach(r => console.log(r.url));

// Cleanup
page.removeAllListeners('response');

Reference Files

FileUse When
PW-WRITER.mdUsing pw-writer for complex scraping, need full API docs
PW-FAST.mdUsing pw-fast batch execution, selector system details
TROUBLESHOOTING.mdEncountering errors (strict mode, redirects, timeouts)
PATTERNS.mdLooking for reusable patterns (tables, forms, pagination)

Playwright Selector Priority

For both tools, prefer selectors in this order:

  1. Best: [data-testid="submit"] - Explicit test attributes
  2. Good: getByRole('button', { name: 'Save' }) - Semantic ARIA
  3. Good: getByText('Sign in'), getByLabel('Email') - User-facing
  4. OK: input[name="email"] - Semantic HTML attributes
  5. Avoid: .btn-primary, #submit - Classes/IDs change frequently
  6. Last resort: div > form > button - Fragile path selectors

Key Differences Summary

Featurepw-writerpw-fast
Primary Toolexecute (single tool, full API)30+ specialized tools
Token OptimizationgetCleanHTML (best)expectation parameter
Element Selectionaria-ref=eN from snapshotSelector arrays with fallback
State Persistencestate objectNone between calls
Multiple Pagescontext.pages()browser_tab_* tools
Error RecoveryFull Playwright try/catchcontinueOnError in batch