AgentSkillsCN

oxylabs-web-scraper

采用生产级网络爬虫技术,内置自动绕过反机器人防护机制,支持对 40 多个目标进行结构化 JSON 解析,并可实现地理定向功能。适用于用户需要爬取网页、提取商品数据、获取搜索结果,或从受支持的电商及搜索引擎平台中收集结构化数据,且无需担心被封禁,同时又需进行地理定向时使用。

SKILL.md
--- frontmatter
name: oxylabs-web-scraper
description: Production-grade web scraping with automatic anti-bot bypass, structured JSON parsing for 40+ targets, and geo-targeting. Use when the user needs to scrape web pages, extract product data, get search results, or collect structured data from supported e-commerce and search platforms without worrying about getting blocked and when geo targeting is required.

Oxylabs Web Scraper API

Authentication

Requires HTTP Basic Auth with credentials from environment variables:

bash
curl -u "$OXY_WSA_USERNAME:$OXY_WSA_PASSWORD" ...

Endpoint

code
POST https://realtime.oxylabs.io/v1/queries
Content-Type: application/json

Core Parameters

ParameterRequiredDescription
sourceYesTarget scraper (e.g., universal, amazon_product, google_search)
urlConditionalURL to scrape (for universal and *_url sources)
queryConditionalSearch query or product ID (for *_search and *_product sources)
parseNoEnable structured data parsing (recommended for supported sources)
renderNoJavaScript rendering: html or png
geo_locationNoGeographic targeting (country, state, or ZIP code)

Quick Start

Scrape any URL:

bash
curl -X POST 'https://realtime.oxylabs.io/v1/queries' \
  -u "$OXY_WSA_USERNAME:$OXY_WSA_PASSWORD" \
  -H 'Content-Type: application/json' \
  -d '{"source": "universal", "url": "https://example.com"}'

Google search with parsing:

bash
curl -X POST 'https://realtime.oxylabs.io/v1/queries' \
  -u "$OXY_WSA_USERNAME:$OXY_WSA_PASSWORD" \
  -H 'Content-Type: application/json' \
  -d '{"source": "google_search", "query": "best laptops", "parse": true}'

Amazon product by ASIN:

bash
curl -X POST 'https://realtime.oxylabs.io/v1/queries' \
  -u "$OXY_WSA_USERNAME:$OXY_WSA_PASSWORD" \
  -H 'Content-Type: application/json' \
  -d '{"source": "amazon_product", "query": "B07FZ8S74R", "parse": true}'

Choosing the Right Source

  1. Use specific sources when available (amazon_product, google_search) - better parsing and reliability
  2. Use universal for unsupported sites - works with any URL
  3. Enable parse: true for structured JSON output on supported sources

Response Structure

json
{
  "results": [{
    "content": "...",
    "status_code": 200,
    "url": "https://..."
  }]
}

With parse: true, content contains structured data (title, price, reviews, etc.) instead of raw HTML.

Available Sources

For the complete list of 40+ supported sources organized by category, see sources.md.

More Examples

For detailed request/response examples including geo-location, JavaScript rendering, and custom headers, see examples.md.

Error Handling

CodeMeaning
200Success
400Invalid parameters
401Authentication failed
403Access denied
429Rate limit exceeded

Key Guidelines

  • Always set parse: true for supported sources to get structured data
  • Use ZIP codes for US e-commerce geo-location (e.g., "90210")
  • Use country/state format for search engines (e.g., "California,United States")
  • Add render: "html" for JavaScript-heavy pages