Scraper Development Skill
Overview
This skill guides you through creating new property scrapers for the brickston-ai competition analysis system.
File Locations
- •Scrapers:
apps/api/app/data/scrapers/ - •Provider Base:
apps/api/app/data/scrapers/providers/ - •Orchestrator:
apps/api/app/data/scrapers/scraper_orchestrator.py - •Factory:
apps/api/app/data/scrapers/scraper_factory.py - •Config:
apps/api/app/core/scraper_config.py
Creating a New Provider
Step 1: Create Provider File
Location: apps/api/app/data/scrapers/providers/<source>_provider.py
python
from typing import List, Optional
import httpx
from app.data.scrapers.providers.base_provider import BaseProvider
from app.data.scrapers.models import ListingData
from httpx import ProxyError
class NewSourceProvider(BaseProvider):
"""Scraper for newsource.com listings."""
SOURCE_NAME = "newsource"
BASE_URL = "https://newsource.com"
async def scrape(self, proxy_url: Optional[str] = None) -> List[ListingData]:
"""Scrape listings from the source."""
listings = []
try:
async with httpx.AsyncClient(proxy=proxy_url, timeout=30.0) as client:
response = await client.get(f"{self.BASE_URL}/api/listings")
response.raise_for_status()
data = response.json()
for item in data.get("listings", []):
listings.append(self._parse_listing(item))
except ProxyError as e:
# Retry without proxy
self.logger.warning(f"Proxy failed, retrying without: {e}")
return await self.scrape(proxy_url=None)
except Exception as e:
self.logger.error(f"Scrape failed: {e}")
raise
return listings
def _parse_listing(self, raw: dict) -> ListingData:
"""Parse raw listing data into standardized format."""
return ListingData(
source=self.SOURCE_NAME,
external_id=raw.get("id"),
address=raw.get("address"),
city=raw.get("city"),
state=raw.get("state"),
zip_code=raw.get("zip"),
price=raw.get("rent"),
bedrooms=raw.get("beds"),
bathrooms=raw.get("baths"),
sqft=raw.get("sqft"),
url=raw.get("url"),
)
Step 2: Register in Factory
Edit apps/api/app/data/scrapers/scraper_factory.py:
python
from app.data.scrapers.providers.newsource_provider import NewSourceProvider
PROVIDERS = {
# ... existing providers
"newsource": NewSourceProvider,
}
Step 3: Add to Orchestrator (Optional)
If the scraper should run in nightly jobs, add to orchestrator config.
Error Handling Patterns
Proxy Error Recovery
Always implement proxy fallback:
python
except ProxyError as e:
self.logger.warning(f"Proxy failed for {self.SOURCE_NAME}, retrying without proxy")
return await self.scrape(proxy_url=None)
Rate Limiting
python
import asyncio
async def scrape_with_rate_limit(self, urls: List[str]):
for i, url in enumerate(urls):
if i > 0:
await asyncio.sleep(1.0) # 1 second delay between requests
# ... scrape logic
Retry Logic
python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(self, url: str):
# ... fetch logic
Testing
bash
# Test single scraper cd apps/api python -m pytest tests/scrapers/test_newsource.py -v # Test with scripts python scripts/test_scrapers.py --provider newsource
Data Model
Ensure listings match the ListingData schema:
- •
source: Provider name - •
external_id: Unique ID from source - •
address,city,state,zip_code: Location - •
price: Monthly rent - •
bedrooms,bathrooms,sqft: Unit specs - •
url: Link to original listing
Checklist
- • Provider class created in
providers/ - • Registered in
scraper_factory.py - • Proxy error handling implemented
- • Rate limiting for respectful scraping
- • Standardized
ListingDataoutput - • Unit tests written