AgentSkillsCN

crawler-scraper

在为爬虫添加新的抓取目标时使用。

SKILL.md
--- frontmatter
name: crawler-scraper
description: Use when adding new scraping targets to the crawler

Crawler Scraper Skill

Checklist (MUST complete all)

  • Add URL to packages/meta/src/urls.ts
  • Create scraper function in apps/crawler/src/scrapers/
  • Define types in packages/db/src/types.ts
  • Add repository if new data storage needed
  • Integrate into apps/crawler/src/index.ts
  • Add unit tests for parsers

File Locations

PurposeLocation
URLspackages/meta/src/urls.ts
Scrapersapps/crawler/src/scrapers/*.ts
Typespackages/db/src/types.ts
Repositoriespackages/db/src/repositories/
Parsersapps/crawler/src/parsers.ts
Entry pointapps/crawler/src/index.ts

Template

typescript
// apps/crawler/src/scrapers/my-feature.ts
import type { MyData } from "@moneyforward-daily-action/db/types";
import type { Page } from "playwright";
import { mfUrls } from "@moneyforward-daily-action/meta";
import { debug } from "../logger.js";
import { parseJapaneseNumber } from "../parsers.js";

export async function getMyData(page: Page): Promise<MyData> {
  debug("Getting my data from /path...");

  await page.goto(mfUrls.myFeature, {
    waitUntil: "domcontentloaded",
  });
  await page.waitForTimeout(2000);

  // Scraping logic here...
  const rows = page.locator("table tbody tr");
  const count = await rows.count();

  const results: MyItem[] = [];
  for (let i = 0; i < count; i++) {
    const row = rows.nth(i);
    const text = await row
      .locator("td")
      .first()
      .textContent({ timeout: 1000 })
      .catch(() => "");

    results.push({
      // parsed data
    });
  }

  return { items: results };
}

URL Registration

typescript
// packages/meta/src/urls.ts
export const mfUrls = {
  // existing urls...
  myFeature: "https://moneyforward.com/path/to/feature",
};

Testing

Test Types

TypeWhen to UseLocation
UnitService-independent logic (parsers, data transformations)*.test.ts next to source
E2ERequires actual account/service interactionvitest.config.ts e2e project

Rules (MUST follow)

  • NEVER write tests that depend on actual data (personal financial data changes constantly)
  • Unit tests: Use hardcoded mock strings, not real scraped data
  • E2E tests: Only verify page navigation and element existence, not actual values

Running Tests

  • Unit: pnpm --filter @moneyforward-daily-action/crawler test
  • E2E: pnpm --filter @moneyforward-daily-action/crawler test:e2e
  • Local manual testing: SKIP_REFRESH=true pnpm --filter @moneyforward-daily-action/crawler start
  • Debug scripts go in debug/ directory
  • Screenshots saved to debug/ directory

Notes

  • Use parseJapaneseNumber() for Japanese currency format (e.g., "1,234円" → 1234)
  • Use debug() from logger for debug output
  • Handle missing elements gracefully with .catch(() => defaultValue)
  • Always use { timeout: 1000 } for individual element queries to avoid hanging