AgentSkillsCN

oxylabs-web-scraper

专业实施 Oxylabs 网页爬虫 API,用于从各类网站、搜索引擎及电商平台采集数据。当您需要构建网页爬虫、部署数据提取管道、集成 Oxylabs API,或对亚马逊、谷歌、电商网站乃至任意网站进行爬取时,可使用此技能。本技能覆盖实时抓取、推拉模式以及代理端点集成等多种方式,同时支持自定义解析器、浏览器指令与地理定向功能。

SKILL.md
--- frontmatter
name: oxylabs-web-scraper
description: Expert implementation of Oxylabs Web Scraper API for data collection from websites, search engines, and e-commerce platforms. Use when building web scrapers, implementing data extraction pipelines, integrating Oxylabs API, or scraping Amazon, Google, e-commerce sites, or any website. Covers Realtime, Push-Pull, and Proxy Endpoint integration methods, custom parsers, browser instructions, and geo-targeting.

Oxylabs Web Scraper API

Overview

Oxylabs Web Scraper API handles the complete scraping workflow: URL crawling, IP blocking mitigation, data extraction, and cloud storage delivery. It supports 40+ platforms including search engines, e-commerce sites, and general websites.

Quick Reference

Base URLs:

  • Realtime (sync): https://realtime.oxylabs.io/v1/queries
  • Push-Pull (async): https://data.oxylabs.io/v1/queries
  • Proxy Endpoint: realtime.oxylabs.io:60000

Authentication: HTTP Basic Auth with USERNAME:PASSWORD from Oxylabs dashboard.

Integration Methods

Realtime (Synchronous)

Keep connection open until job completes. Best for immediate results.

python
import requests

payload = {
    "source": "universal",
    "url": "https://example.com",
    "geo_location": "United States",
    "render": "html",
    "parse": True
}

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=("USERNAME", "PASSWORD"),
    json=payload
)
result = response.json()

Response structure:

json
{
  "results": [{
    "content": "<html>...</html>",
    "created_at": "2024-06-26 13:13:06",
    "url": "https://example.com/",
    "job_id": "12345678900987654321",
    "status_code": 200
  }]
}

Push-Pull (Asynchronous)

Submit job, retrieve results later. Recommended for large volumes.

python
import requests
import time

# Submit job
payload = {
    "source": "universal",
    "url": "https://example.com",
    "callback_url": "https://your-webhook.com/callback"  # Optional
}

response = requests.post(
    "https://data.oxylabs.io/v1/queries",
    auth=("USERNAME", "PASSWORD"),
    json=payload
)
job = response.json()
job_id = job["id"]

# Poll for results (or use callback_url)
while True:
    status = requests.get(
        f"https://data.oxylabs.io/v1/queries/{job_id}",
        auth=("USERNAME", "PASSWORD")
    ).json()

    if status["status"] == "done":
        break
    elif status["status"] == "faulted":
        raise Exception("Job failed")
    time.sleep(2)

# Retrieve results
results = requests.get(
    f"https://data.oxylabs.io/v1/queries/{job_id}/results",
    auth=("USERNAME", "PASSWORD")
).json()

Batch processing (up to 5,000 items):

python
payload = {
    "source": "universal",
    "url": ["https://example1.com", "https://example2.com", "https://example3.com"],
    "geo_location": "United States"
}

response = requests.post(
    "https://data.oxylabs.io/v1/queries/batch",
    auth=("USERNAME", "PASSWORD"),
    json=payload
)

Result types: ?type=raw (HTML), ?type=parsed (JSON), ?type=png, ?type=markdown

Proxy Endpoint

Use like a standard proxy. GET requests only.

python
import requests

proxies = {
    "http": "http://USERNAME:PASSWORD@realtime.oxylabs.io:60000",
    "https": "http://USERNAME:PASSWORD@realtime.oxylabs.io:60000"
}

response = requests.get(
    "https://example.com",
    proxies=proxies,
    verify=False,  # Required
    headers={
        "x-oxylabs-geo-location": "Germany",
        "x-oxylabs-render": "html"
    }
)

Sources

Universal Source

Scrape any website. Use source: "universal" with a URL.

python
payload = {
    "source": "universal",
    "url": "https://example.com",
    "geo_location": "United States",
    "render": "html",
    "parse": True
}

Amazon Sources

SourcePurposeQuery Type
amazon_productProduct pageASIN
amazon_searchSearch resultsSearch term
amazon_pricingOffer listingsASIN
amazon_sellersSeller infoSeller ID
amazon_bestsellersBest sellersCategory
python
# Product by ASIN
payload = {
    "source": "amazon_product",
    "query": "B07FZ8S74R",
    "geo_location": "90210",
    "parse": True
}

# Search
payload = {
    "source": "amazon_search",
    "query": "laptop",
    "geo_location": "United States",
    "parse": True
}

Google Sources

SourcePurpose
google_searchWeb, Image, News SERPs
google_adsAd-optimized SERPs
google_shopping_searchShopping results
google_shopping_productProduct pages
google_mapsLocal search
google_trends_exploreTrend data
google_travel_hotelsHotel search
google_lensImage recognition
python
payload = {
    "source": "google_search",
    "query": "web scraping",
    "geo_location": "California,United States",
    "parse": True
}

Other Sources

E-commerce: walmart, ebay, etsy, alibaba, aliexpress Travel: airbnb, zillow Video: youtube_search, tiktok_shop

See references/sources.md for complete list with parameters.

Key Parameters

ParameterDescriptionExample
sourceScraper type (required)"universal"
urlTarget URL"https://example.com"
querySearch term or ID"laptop" or "B07FZ8S74R"
geo_locationProxy location"United States", "90210"
renderJS rendering"html" or "png"
parseEnable parsingtrue
user_agent_typeBrowser type"desktop_chrome"
callback_urlWebhook URL"https://your-site.com/hook"
session_idMaintain same IP"session123"

Custom Parser

Extract structured data using XPath/CSS selectors.

python
payload = {
    "source": "universal",
    "url": "https://example.com/products",
    "parse": True,
    "parsing_instructions": {
        "product_title": {
            "_fns": [
                {"_fn": "xpath_one", "_args": ["//h1[@class='title']/text()"]}
            ]
        },
        "price": {
            "_fns": [
                {"_fn": "xpath_one", "_args": ["//span[@class='price']/text()"]},
                {"_fn": "amount_from_string"}
            ]
        },
        "items": {
            "_fns": [
                {"_fn": "xpath", "_args": ["//li[@class='item']/text()"]},
                {"_fn": "length"}
            ]
        }
    }
}

Functions:

  • xpath_one - Extract single element
  • xpath - Extract multiple elements
  • amount_from_string - Convert text to number
  • length - Count items

Parser Presets: Save parsers for reuse via parser_preset parameter.

See references/custom-parser.md for detailed syntax.

Browser Instructions

Automate interactions before scraping (clicks, scrolling, typing).

python
payload = {
    "source": "universal",
    "url": "https://example.com",
    "render": "html",
    "browser_instructions": [
        {"type": "wait", "wait_time_s": 2},
        {"type": "click", "selector": {"type": "css", "value": "#load-more"}},
        {"type": "wait_for_element", "selector": {"type": "css", "value": ".results"}},
        {"type": "scroll", "y": 500},
        {"type": "input", "selector": {"type": "css", "value": "#search"}, "value": "query"}
    ]
}

Error Handling

Job statuses:

  • pending - Processing
  • done - Complete
  • faulted - Error (no charge)

Parse status codes:

  • 12000 - Success
  • 12005 - Parsed with warnings
  • 12002/12006/12007 - Error (no charge)

Connection timeout: 150 seconds TTL

Best Practices

  1. Use Push-Pull for volume - More reliable for large datasets
  2. Enable render: "html" - When pages load content via JavaScript
  3. Use parse: true - Get structured JSON instead of raw HTML
  4. Set geo_location - Match target audience location
  5. Use session_id - Maintain same IP for multi-page sessions
  6. Handle rate limits - Implement exponential backoff
  7. Store results - Configure storage_type and storage_url for S3/GCS

Resources

  • references/sources.md - Complete source list with parameters
  • references/custom-parser.md - Parser syntax and functions
  • scripts/scraper.py - Reusable scraper class