AgentSkillsCN

authenticated-scrape

通过自动捕获带认证头的网络请求,从已认证网站抓取数据。当用户想要从登录页面、私人仪表板或已认证API中提取数据时使用。

SKILL.md
--- frontmatter
name: authenticated-scrape
description: Scrape data from authenticated websites by capturing network requests with auth headers automatically. Use when the user wants to extract data from logged-in pages, private dashboards, or authenticated APIs.
allowed-tools: mcp__chrome-devtools__list_pages, mcp__chrome-devtools__new_page, mcp__chrome-devtools__navigate_page, mcp__chrome-devtools__take_snapshot, mcp__chrome-devtools__list_network_requests, mcp__chrome-devtools__get_network_request, Bash, Write, Read

Authenticated Scrape

A guided workflow for scraping authenticated pages using Chrome DevTools automation. This skill uses dev-browser under the hood to capture network requests with auth headers automatically.

What This Skill Does

Helps you scrape data from authenticated pages by:

  1. Opening a browser and navigating to the target site
  2. Letting you log in normally (or automating it)
  3. Capturing authenticated API requests automatically
  4. Extracting the data you need
  5. Creating reusable code/scripts for future scraping

Workflow

Step 1: Launch Browser & Navigate

  • Use mcp__chrome-devtools__list_pages to check existing pages
  • Use mcp__chrome-devtools__new_page or mcp__chrome-devtools__navigate_page to open the target site
  • Ask user to log in manually, OR offer to automate login if they want

Step 2: Navigate to Target Content

  • Once authenticated, navigate to the page with the data they want to scrape
  • Use mcp__chrome-devtools__take_snapshot to verify page loaded correctly

Step 3: Capture Network Requests

  • Use mcp__chrome-devtools__list_network_requests to capture all API calls
  • Filter for XHR/Fetch requests (these usually contain the data)
  • Show user a clean list of endpoints captured (with their URLs and types)

Step 4: Identify Target Request

  • Ask user which request contains the data they want
  • Use mcp__chrome-devtools__get_network_request to show details
  • Display the request URL, headers (including auth), and response preview

Step 5: Extract Data

  • Get the full response data from the request
  • Parse JSON or HTML as needed
  • Ask what specific data points they want to extract
  • Use jq, JavaScript, or other tools to extract the data

Step 6: Make It Reusable

  • Offer to create a standalone script that:
    • Uses the same headers/cookies
    • Makes the request programmatically
    • Parses and extracts the data
  • Save as a Node.js script, Python script, or simple curl command
  • Remind them that auth tokens expire

Important Reminders

  • Security: Network requests contain sensitive auth tokens. Handle carefully.
  • Token Expiration: Session tokens expire. Scripts may need token refresh logic.
  • Ethics: Only scrape your own authenticated sessions and respect ToS.
  • Rate Limiting: Be respectful if automating frequent requests.

Known Limitations

Automated Login Detection: Many sites (GitHub, Google, banking sites) detect automated browsers via Chrome DevTools Protocol and block login attempts. This is a security feature.

Workarounds:

  1. Manual login approach: Ask user to log in manually in the browser window
  2. Regular browser first: Have user log in via regular Chrome, then capture requests
  3. Focus on data extraction: Skip automated login, focus on capturing already-authenticated sessions
  4. API-friendly sites: Some demo/test APIs (ReqRes, JSONPlaceholder) are more lenient

What Works Well:

  • ✅ Capturing network requests from any page
  • ✅ Extracting headers, cookies, auth tokens
  • ✅ Parsing JSON/HTML responses
  • ✅ Generating reusable scripts
  • ✅ Sites without strict bot detection

What May Not Work:

  • ❌ Automated login on major platforms (GitHub, Google, Facebook)
  • ❌ Sites with aggressive bot detection
  • ❌ Multi-factor authentication (requires manual intervention)

Troubleshooting

"Could not log in - This browser may not be secure"

Cause: Site detects automated browser (DevTools Protocol) Solution: Have user log in manually in the browser window instead of automating login

Empty network requests / No XHR captured

Cause: Page hasn't loaded data yet or uses different request types Solution:

  • Wait for page to fully load
  • Check all request types, not just XHR/Fetch
  • Navigate to the page that actually loads the data

"Unexpected token '<'" or HTML instead of JSON

Cause: API endpoint requires authentication or returns error page Solution:

  • Verify the endpoint URL is correct
  • Check if authentication headers were captured
  • Try the request in the browser first to confirm it works

Tips

  • If the page uses complex auth (OAuth, JWT), capture the Authorization header
  • For cookie-based auth, capture the Cookie header
  • If requests fail later, you may need to recapture with fresh tokens
  • For pagination, help identify the pagination parameters in the request
  • Best practice: Let user log in manually, then capture the authenticated session

Example Interaction

code
User: /authenticated-scrape

You: I'll help you scrape authenticated content. First, let me check if there's already a browser page open.

[Lists pages, or creates new one]

You: I've opened https://example.com/dashboard. Please log in manually in the browser.

[User logs in]

You: Great! Now navigate to the page with the data you want to scrape.

[User navigates to /api/users endpoint or data page]

You: I've captured 12 network requests. Here are the XHR/Fetch requests:
1. GET /api/users - 200 OK (JSON, 45KB)
2. GET /api/analytics - 200 OK (JSON, 12KB)
3. POST /api/events - 204 No Content

Which request has the data you want?

User: The first one

You: [Shows request details with headers and preview]

Found Authorization: Bearer eyJhbG... and Cookie: session_id=abc123...

The response contains an array of 200 users with fields: id, email, name, created_at.

What data would you like to extract?

User: All the emails

You: [Extracts emails and offers to save]

I can create a reusable Node.js script that makes this request with the same auth headers. Would you like me to do that?

Start Here

When the skill is invoked, begin by asking the user:

  1. What website/service they want to scrape
  2. Whether they're already logged in or need help with authentication

Then proceed with Step 1 of the workflow.