AgentSkillsCN

scraper-development

创建支持代理IP与容错机制的房产列表爬虫

SKILL.md
--- frontmatter
name: scraper-development
description: Create property listing scrapers with proxy support and error resilience

Scraper Development Skill

Overview

This skill guides you through creating new property scrapers for the brickston-ai competition analysis system.

File Locations

  • Scrapers: apps/api/app/data/scrapers/
  • Provider Base: apps/api/app/data/scrapers/providers/
  • Orchestrator: apps/api/app/data/scrapers/scraper_orchestrator.py
  • Factory: apps/api/app/data/scrapers/scraper_factory.py
  • Config: apps/api/app/core/scraper_config.py

Creating a New Provider

Step 1: Create Provider File

Location: apps/api/app/data/scrapers/providers/<source>_provider.py

python
from typing import List, Optional
import httpx
from app.data.scrapers.providers.base_provider import BaseProvider
from app.data.scrapers.models import ListingData
from httpx import ProxyError

class NewSourceProvider(BaseProvider):
    """Scraper for newsource.com listings."""
    
    SOURCE_NAME = "newsource"
    BASE_URL = "https://newsource.com"
    
    async def scrape(self, proxy_url: Optional[str] = None) -> List[ListingData]:
        """Scrape listings from the source."""
        listings = []
        
        try:
            async with httpx.AsyncClient(proxy=proxy_url, timeout=30.0) as client:
                response = await client.get(f"{self.BASE_URL}/api/listings")
                response.raise_for_status()
                data = response.json()
                
                for item in data.get("listings", []):
                    listings.append(self._parse_listing(item))
                    
        except ProxyError as e:
            # Retry without proxy
            self.logger.warning(f"Proxy failed, retrying without: {e}")
            return await self.scrape(proxy_url=None)
        except Exception as e:
            self.logger.error(f"Scrape failed: {e}")
            raise
            
        return listings
    
    def _parse_listing(self, raw: dict) -> ListingData:
        """Parse raw listing data into standardized format."""
        return ListingData(
            source=self.SOURCE_NAME,
            external_id=raw.get("id"),
            address=raw.get("address"),
            city=raw.get("city"),
            state=raw.get("state"),
            zip_code=raw.get("zip"),
            price=raw.get("rent"),
            bedrooms=raw.get("beds"),
            bathrooms=raw.get("baths"),
            sqft=raw.get("sqft"),
            url=raw.get("url"),
        )

Step 2: Register in Factory

Edit apps/api/app/data/scrapers/scraper_factory.py:

python
from app.data.scrapers.providers.newsource_provider import NewSourceProvider

PROVIDERS = {
    # ... existing providers
    "newsource": NewSourceProvider,
}

Step 3: Add to Orchestrator (Optional)

If the scraper should run in nightly jobs, add to orchestrator config.

Error Handling Patterns

Proxy Error Recovery

Always implement proxy fallback:

python
except ProxyError as e:
    self.logger.warning(f"Proxy failed for {self.SOURCE_NAME}, retrying without proxy")
    return await self.scrape(proxy_url=None)

Rate Limiting

python
import asyncio

async def scrape_with_rate_limit(self, urls: List[str]):
    for i, url in enumerate(urls):
        if i > 0:
            await asyncio.sleep(1.0)  # 1 second delay between requests
        # ... scrape logic

Retry Logic

python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(self, url: str):
    # ... fetch logic

Testing

bash
# Test single scraper
cd apps/api
python -m pytest tests/scrapers/test_newsource.py -v

# Test with scripts
python scripts/test_scrapers.py --provider newsource

Data Model

Ensure listings match the ListingData schema:

  • source: Provider name
  • external_id: Unique ID from source
  • address, city, state, zip_code: Location
  • price: Monthly rent
  • bedrooms, bathrooms, sqft: Unit specs
  • url: Link to original listing

Checklist

  • Provider class created in providers/
  • Registered in scraper_factory.py
  • Proxy error handling implemented
  • Rate limiting for respectful scraping
  • Standardized ListingData output
  • Unit tests written