Scraper Development Skill

Overview

This skill guides you through creating new property scrapers for the brickston-ai competition analysis system.

File Locations

•Scrapers: apps/api/app/data/scrapers/
•Provider Base: apps/api/app/data/scrapers/providers/
•Orchestrator: apps/api/app/data/scrapers/scraper_orchestrator.py
•Factory: apps/api/app/data/scrapers/scraper_factory.py
•Config: apps/api/app/core/scraper_config.py

Creating a New Provider

Step 1: Create Provider File

Location: apps/api/app/data/scrapers/providers/<source>_provider.py

python

from typing import List, Optional
import httpx
from app.data.scrapers.providers.base_provider import BaseProvider
from app.data.scrapers.models import ListingData
from httpx import ProxyError

class NewSourceProvider(BaseProvider):
    """Scraper for newsource.com listings."""
    
    SOURCE_NAME = "newsource"
    BASE_URL = "https://newsource.com"
    
    async def scrape(self, proxy_url: Optional[str] = None) -> List[ListingData]:
        """Scrape listings from the source."""
        listings = []
        
        try:
            async with httpx.AsyncClient(proxy=proxy_url, timeout=30.0) as client:
                response = await client.get(f"{self.BASE_URL}/api/listings")
                response.raise_for_status()
                data = response.json()
                
                for item in data.get("listings", []):
                    listings.append(self._parse_listing(item))
                    
        except ProxyError as e:
            # Retry without proxy
            self.logger.warning(f"Proxy failed, retrying without: {e}")
            return await self.scrape(proxy_url=None)
        except Exception as e:
            self.logger.error(f"Scrape failed: {e}")
            raise
            
        return listings
    
    def _parse_listing(self, raw: dict) -> ListingData:
        """Parse raw listing data into standardized format."""
        return ListingData(
            source=self.SOURCE_NAME,
            external_id=raw.get("id"),
            address=raw.get("address"),
            city=raw.get("city"),
            state=raw.get("state"),
            zip_code=raw.get("zip"),
            price=raw.get("rent"),
            bedrooms=raw.get("beds"),
            bathrooms=raw.get("baths"),
            sqft=raw.get("sqft"),
            url=raw.get("url"),
        )

Step 2: Register in Factory

Edit apps/api/app/data/scrapers/scraper_factory.py:

python

from app.data.scrapers.providers.newsource_provider import NewSourceProvider

PROVIDERS = {
    # ... existing providers
    "newsource": NewSourceProvider,
}

Step 3: Add to Orchestrator (Optional)

If the scraper should run in nightly jobs, add to orchestrator config.

Error Handling Patterns

Proxy Error Recovery

Always implement proxy fallback:

python

except ProxyError as e:
    self.logger.warning(f"Proxy failed for {self.SOURCE_NAME}, retrying without proxy")
    return await self.scrape(proxy_url=None)

Rate Limiting

python

import asyncio

async def scrape_with_rate_limit(self, urls: List[str]):
    for i, url in enumerate(urls):
        if i > 0:
            await asyncio.sleep(1.0)  # 1 second delay between requests
        # ... scrape logic

Retry Logic

python

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(self, url: str):
    # ... fetch logic

Testing

bash

# Test single scraper
cd apps/api
python -m pytest tests/scrapers/test_newsource.py -v

# Test with scripts
python scripts/test_scrapers.py --provider newsource

Data Model

Ensure listings match the ListingData schema:

•source: Provider name
•external_id: Unique ID from source
•address, city, state, zip_code: Location
•price: Monthly rent
•bedrooms, bathrooms, sqft: Unit specs
•url: Link to original listing

Checklist

• Provider class created in providers/
• Registered in scraper_factory.py
• Proxy error handling implemented
• Rate limiting for respectful scraping
• Standardized ListingData output
• Unit tests written