Invoice Processing
Workflow
code
Invoice Processing: - [ ] Step 1: Log start time - [ ] Step 2: Extract PDF text - [ ] Step 3: Parse invoice fields - [ ] Step 4: Validate (run validate_invoice.py) - [ ] Step 5: Fix errors and re-validate if needed - [ ] Step 6: Save final output AND eval log
Step 1: Log start time
Record the start time for eval tracking:
python
from datetime import datetime start_time = datetime.now().isoformat()
Step 2: Extract text
python
from pypdf import PdfReader
reader = PdfReader("invoice.pdf")
text = ""
for page in reader.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
Step 3: Parse fields
Extract from text:
- •vendor: Company name (usually at document top, in larger font)
- •invoice_number: Look for "Invoice #", "Invoice No.", "INV-", "#"
- •date: Invoice/billing date -> convert to YYYY-MM-DD
- •total: Final amount ("Total:", "Amount Due:", "Balance:")
- •currency: Default USD if not specified
Step 4: Validate
Run: python scripts/validate_invoice.py output.json
Step 5: Fix and re-validate
If validation fails:
- •Read the specific error message
- •Re-examine PDF text for that field
- •Update JSON with corrected value
- •Run validation again
- •Repeat until validation passes
Common issues: See TROUBLESHOOTING.md
Step 6: Save results
Save two files:
- •Output file (requested by user):
json
{
"vendor": "...",
"invoice_number": "...",
"date": "YYYY-MM-DD",
"total": 0.00
}
- •Eval log (always append to
eval_results/all_evals.jsonl):
bash
python scripts/collect_eval.py "<task_id>" "<original_task_prompt>" "<output_file>" "<notes>"
Example:
bash
python scripts/collect_eval.py "invoice-basic" "Extract invoice data from invoice.pdf" "output.json" "validation passed on first attempt"
Always append to eval_results/all_evals.jsonl (one JSON per line) if it exists.
Output format
json
{
"vendor": "Company Name",
"invoice_number": "INV-2025-001",
"date": "2025-01-15",
"total": 1250.00,
"currency": "USD",
"line_items": []
}
Validation rules
See VALIDATION.md for complete rules.