Validate Scraper Output
Validates scraped store data for quality issues using the project's validation utilities.
Usage
| Command | Description |
|---|---|
/validate-output verizon | Validate Verizon store data |
/validate-output all | Validate all retailers |
/validate-output target --strict | Strict mode (treat warnings as errors) |
What Gets Validated
Required Fields (errors if missing)
- •
store_id- Unique store identifier - •
name- Store name - •
street_address- Street address - •
city- City name - •
state- State code (2 letters for US)
Recommended Fields (warnings if missing)
- •
latitude/longitude- Geographic coordinates - •
phone- Contact phone number - •
url- Store page URL
Additional Checks
- •Coordinates within valid bounds (-90 to 90 lat, -180 to 180 lon)
- •Not (0,0) coordinates (common scraping error)
- •Duplicate store_id detection
- •Empty string vs null field values
Execution Steps
- •Load store data from
data/{retailer}/output/stores_latest.json - •Run
validate_stores_batch()fromsrc/shared/utils.py - •Generate quality report with:
- •Total stores and valid count
- •Field completeness percentages
- •List of specific errors/warnings (first 10)
- •Critical issues requiring attention
Example Output
code
=== Validation Report: verizon === Total stores: 1,247 Valid stores: 1,189 (95.3%) Field Completeness: store_id: 100.0% name: 100.0% street_address: 100.0% city: 100.0% state: 100.0% latitude: 98.2% longitude: 98.2% phone: 87.4% Issues Found: - ERROR: Store 'VZW-1234' missing latitude/longitude - WARNING: Store 'VZW-5678' has coordinates (0.0, 0.0) ... (22 more issues)
Implementation
python
import json
from pathlib import Path
from src.shared.utils import validate_stores_batch
# Load data
data_path = Path(f"data/{retailer}/output/stores_latest.json")
with open(data_path) as f:
stores = json.load(f)
# Validate
summary = validate_stores_batch(stores, strict=strict_mode, log_issues=True)
# Report completeness
fields = ['store_id', 'name', 'street_address', 'city', 'state', 'latitude', 'longitude', 'phone', 'url']
for field in fields:
present = sum(1 for s in stores if s.get(field))
pct = (present / len(stores)) * 100
print(f" {field}: {pct:.1f}%")
Error Handling
- •If no
stores_latest.jsonexists, suggest running the scraper first - •If file is empty or malformed, report JSON parse error
- •If retailer doesn't exist, list available retailers