Lead Scraping & Verification
Goal
Scrape leads using Apify (code_crafter/leads-finder), verify their relevance (industry match > 80%), and save them to a Google Sheet. For large scrapes (1000+ leads), use parallel scraping for 3-5x faster performance.
Inputs
- •Industry: The target industry (e.g., "Plumbers", "Software Agencies")
- •Location: The target location (e.g., "Texas", "United States", "California"). Scripts auto-format to Apify's required format (US states get ", us" suffix automatically).
- •Total Count: The total number of leads desired
Scripts
All scripts are in ./scripts/:
- •
scrape_apify.py- Single scrape, for <1000 leads - •
scrape_apify_parallel.py- Parallel scraping, for 1000+ leads - •
classify_leads_llm.py- LLM-based lead classification - •
enrich_emails.py- Email enrichment via AnyMailFinder - •
update_sheet.py- Batch sheet updates - •
read_sheet.py- Read data from Google Sheets
Process
Small Scrapes (<1000 leads)
- •
Test Scrape
bashpython3 ./scripts/scrape_apify.py --query "INDUSTRY" --location "LOCATION" --max_items 25 --no-email-filter --output .tmp/test_leads.json
- •
Verification
- •Read
.tmp/test_leads.json - •Check if at least 20/25 (80%) leads match the Industry
- •Pass: Proceed to step 3
- •Fail: Stop and ask user to refine keywords
- •Read
- •
Full Scrape
bashpython3 ./scripts/scrape_apify.py --query "INDUSTRY" --location "LOCATION" --max_items TOTAL_COUNT --no-email-filter --output .tmp/leads.json
- •
[Optional] LLM Classification (for complex niches)
bashpython3 ./scripts/classify_leads_llm.py .tmp/leads.json --classification_type product_saas --output .tmp/classified_leads.json
- •
Upload to Google Sheet
bashpython3 ./scripts/update_sheet.py .tmp/leads.json --title "Leads - INDUSTRY"
- •
Enrich Missing Emails
bashpython3 ./scripts/enrich_emails.py SHEET_URL
Large Scrapes (1000+ leads)
- •
Test Scrape (same as above with 25 items)
- •
Parallel Full Scrape
bashpython3 ./scripts/scrape_apify_parallel.py \ --query "INDUSTRY" \ --total_count 4000 \ --location "United States" \ --strategy regions \ --no-email-filter
Geographic partitioning is automatic:
- •United States: 4-way (Northeast, Southeast, Midwest, West)
- •EU/Europe: 4-way (Western, Southern, Northern, Eastern)
- •UK: 4-way (SE England, N England, Scotland/Wales, SW England)
- •Canada: 4-way (Ontario, Quebec, West, Atlantic)
- •Australia: 4-way (NSW, VIC/TAS, QLD, WA/SA)
- •
Continue with steps 4-6 from small scrapes
Outputs
The ONLY deliverable is the Google Sheet URL. Local JSON files in .tmp/ are temporary intermediates.
Edge Cases
- •No leads found: Ask user to broaden search
- •API Error: Check credentials in
.env - •Low quality classifications: If >80% "unclear", improve scrape keywords
Environment
Requires in .env:
code
APIFY_API_TOKEN=your_token GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json ANTHROPIC_API_KEY=your_key ANYMAILFINDER_API_KEY=your_key