Receipt Parser Engineer

Name: receipt-parser-engineer
Rating: 92
Author: ThomasMcCrossin

Overview

Build and maintain vendor parsers that turn extracted receipt text into ReceiptNormalized with high accuracy, backed by golden fixtures. Keep parsing deterministic where possible and treat Claude Vision as the safety net for unknown vendors.

Workflow Decision Tree

•
Start from the file type
- •PDF: use embedded text when possible; if the PDF has little/no embedded text, OCR it.
- •Image: OCR it.
- •Email HTML/text: parse directly (no OCR).
•
Decide whether to write/extend a deterministic parser
- •Write/refine a deterministic parser when the vendor is high-volume, has structured line items, or needs reliable tax/subtotal/total.
- •Prefer Claude Vision fallback when the vendor is low-volume, highly variable, or you lack enough samples to stabilize patterns.
•
Choose the OCR/text-extraction path
- •Rule: pdfplumber is for text-based PDFs; AWS Textract is for images and anything pdfplumber can’t extract meaningfully from a PDF.

Core Invariants

•Do not introduce any local OCR-binary wrapper or dependency; use Textract for OCR.
•Keep vendor parsers deterministic: parse from ocr_text (and pdf_path only when table extraction is required).
•Every parser change ships with a golden fixture test that reproduces the bug and prevents regressions.
•Avoid “magic balancing” lines: prefer validation_warnings for missing/faded items rather than inventing data.

Workflow: Add a New Vendor Parser (Deterministic)

•
Collect samples
- •Target 3–10 real receipts/invoices with known-good totals.
- •Prefer multiple layouts (thermal vs letter, refunds vs purchases, discounts/deposits).
•
Generate OCR text for fixtures
- •
  Use the OCR factory (pdfplumber → Textract fallback) from the worker container:
  - •docker compose exec worker python scripts/test_vendor_parsers.py /path/to/receipt.pdf
- •Save to tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt.
•
Create expected outputs
- •
  Copy a known-good parse (or fill by hand) into:
  - •tests/fixtures/golden_receipts/<vendor>/<name>_expected.json
- •Keep expected JSON minimal: only assert the fields you truly want stable.
•
Implement the parser
- •
  Add packages/invoice_parsers/vendors/<vendor>_parser.py:
  - •detect_format(ocr_text) -> bool should be strict enough to avoid false positives.
  - •parse(ocr_text, entity, pdf_path=...) -> ReceiptNormalized should be resilient to OCR noise.
- •If table extraction is required, accept pdf_path and use pdfplumber inside the parser.
•
Register the parser
- •
  Update packages/invoice_parsers/vendor_dispatcher.py:
  - •import your parser
  - •add it to the parser list (before GenericParser)
  - •add the vendor key to GOLDEN_VENDORS when it has golden coverage
•
Add golden tests
- •Update tests/unit/test_invoice_parsers.py with a new @pytest.mark.golden class for the vendor.
- •Assert at minimum: vendor_guess, totals, purchase date, invoice number (if applicable), and line count.
•
Run tests
- •make test-golden
- •If you touch shared parsing behavior, run make test-unit too.

Workflow: Fix/Refine an Existing Parser

•Add a failing fixture first (new <name>_ocr.txt + <name>_expected.json).
•Make the smallest parser change that fixes the issue.
•Re-run make test-golden until green.
•Add/adjust validation_warnings when the receipt can’t be made internally consistent.

Debugging Playbook

•
One-shot OCR + parse (local workbench):
- •python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --file /path/to/receipt.pdf --entity corp
- •python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --ocr-text tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt --entity corp
•
OCR routing: packages/parsers/ocr/factory.py
•
Textract provider: packages/parsers/ocr/provider_textract.py
•
PDF embedded text extraction: packages/parsers/ocr/pdf_text_extractor.py
•
Dispatcher + golden vendor list: packages/invoice_parsers/vendor_dispatcher.py
•
Worker parse routing (incl. Claude Vision heuristic): services/worker/tasks/pipeline/parse.py
•
Golden fixtures: tests/fixtures/golden_receipts/

References (load as needed)

•Agent prompt (Claude/Codex): references/agent-prompt.md
•Repo/pipeline map: references/pipeline-map.md