AgentSkillsCN

salesforce-developer-site-scraper

利用无头 Chromium、Readability 以及 docs 内容 API 备援,将 Salesforce Developer 文档抓取为整洁的 Markdown 格式。适用于内容为异步加载,或因 OneTrust Cookie 横幅而被屏蔽时使用。

SKILL.md
--- frontmatter
name: salesforce-developer-site-scraper
description: 'Scrape Salesforce Developer documentation into clean Markdown using headless Chromium, Readability, and the docs content API fallback. Use when content is async or blocked by OneTrust cookie banners.'
license: Forward Proprietary
compatibility: VS Code 1.x+, Node.js 18+

Salesforce Developer Site Scraper

Use this skill to capture Salesforce Developer documentation pages as clean Markdown even when content loads asynchronously or is blocked by OneTrust cookie banners.

When to Use This Skill

  • A Salesforce Developer doc page renders key content after async requests.
  • A OneTrust cookie banner hides content until consent is accepted.
  • You need a readable Markdown snapshot for Apex, LWC, or platform docs.
  • NOT for: high-volume crawling or scraping behind access restrictions.

Prerequisites

  • Node.js 18+
  • npm dependencies for the script (see below)

How to Use

1) Install script dependencies (if not done already)

bash
npm install playwright @mozilla/readability jsdom turndown

2) Run the script

bash
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --consent-selector "#onetrust-accept-btn-handler" \
  --wait 2000

Script Options

OptionRequiredDescription
--urlYesTarget URL to fetch and extract.
--outYesOutput Markdown file path.
--consent-selectorNoCSS or text selector for a cookie banner accept button.
--waitNoMilliseconds to wait after navigation or consent click.
--content-selectorNoExtract only this element instead of Readability parsing.
--remove-selectorsNoComma-separated selectors to remove before extraction.
--cookieNoConsent/session cookie, e.g. name=value;domain=example.com;path=/.
--storage-stateNoPlaywright storage state JSON file to reuse consent/session.
--timeoutNoNavigation timeout in ms (default 45000).
--no-default-removalsNoDisable default cookie/consent element removals.

Compliance Notes

  • Respect robots.txt and site terms before scraping.
  • Use consent cookies or storage state only when you have permission.
  • Avoid collecting personal data unless you have a legal basis.

Examples

Example: Reuse a consent state

bash
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --storage-state "./artifacts/online-research/consent-state.json"

Example: Extract a specific content container

bash
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --content-selector "main article"

Troubleshooting

Issue: Output is empty or too short

Solution: Add --wait or provide a --content-selector for the primary content node.

Issue: Cookie banner blocks content

Solution: Provide --consent-selector (OneTrust) or reuse a --storage-state with consent already saved.

References