AgentSkillsCN

receipt-parser-engineer

在 curlys-books 中创建、修复并优化确定性的收据/发票解析器,包括供应商识别、OCR 文本提取路由(对于文本 PDF 使用 pdfplumber → AWS Textract 作为备用;对于图像则使用 Textract)、黄金基准的创建,以及对供应商调度器/注册表的更新。适用于新增供应商解析器、调试误检问题、改进总额/税额/日期/行项提取,或在面对尚未经过测试的供应商解析器时,决定何时依赖 Claude Vision 作为备用方案。

SKILL.md
--- frontmatter
name: receipt-parser-engineer
description: Create, fix, and refine deterministic receipt/invoice parsers in curlys-books, including vendor detection, OCR text extraction routing (pdfplumber for text PDFs → AWS Textract fallback; images → Textract), golden fixture creation, and updates to the vendor dispatcher/registry. Use when adding a new vendor parser, debugging mis-detections, improving totals/tax/date/line extraction, or deciding when to rely on Claude Vision fallback for vendors without a tested parser.

Receipt Parser Engineer

Overview

Build and maintain vendor parsers that turn extracted receipt text into ReceiptNormalized with high accuracy, backed by golden fixtures. Keep parsing deterministic where possible and treat Claude Vision as the safety net for unknown vendors.

Workflow Decision Tree

  1. Start from the file type

    • PDF: use embedded text when possible; if the PDF has little/no embedded text, OCR it.
    • Image: OCR it.
    • Email HTML/text: parse directly (no OCR).
  2. Decide whether to write/extend a deterministic parser

    • Write/refine a deterministic parser when the vendor is high-volume, has structured line items, or needs reliable tax/subtotal/total.
    • Prefer Claude Vision fallback when the vendor is low-volume, highly variable, or you lack enough samples to stabilize patterns.
  3. Choose the OCR/text-extraction path

    • Rule: pdfplumber is for text-based PDFs; AWS Textract is for images and anything pdfplumber can’t extract meaningfully from a PDF.

Core Invariants

  • Do not introduce any local OCR-binary wrapper or dependency; use Textract for OCR.
  • Keep vendor parsers deterministic: parse from ocr_text (and pdf_path only when table extraction is required).
  • Every parser change ships with a golden fixture test that reproduces the bug and prevents regressions.
  • Avoid “magic balancing” lines: prefer validation_warnings for missing/faded items rather than inventing data.

Workflow: Add a New Vendor Parser (Deterministic)

  1. Collect samples

    • Target 3–10 real receipts/invoices with known-good totals.
    • Prefer multiple layouts (thermal vs letter, refunds vs purchases, discounts/deposits).
  2. Generate OCR text for fixtures

    • Use the OCR factory (pdfplumber → Textract fallback) from the worker container:
      • docker compose exec worker python scripts/test_vendor_parsers.py /path/to/receipt.pdf
    • Save to tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt.
  3. Create expected outputs

    • Copy a known-good parse (or fill by hand) into:
      • tests/fixtures/golden_receipts/<vendor>/<name>_expected.json
    • Keep expected JSON minimal: only assert the fields you truly want stable.
  4. Implement the parser

    • Add packages/invoice_parsers/vendors/<vendor>_parser.py:
      • detect_format(ocr_text) -> bool should be strict enough to avoid false positives.
      • parse(ocr_text, entity, pdf_path=...) -> ReceiptNormalized should be resilient to OCR noise.
    • If table extraction is required, accept pdf_path and use pdfplumber inside the parser.
  5. Register the parser

    • Update packages/invoice_parsers/vendor_dispatcher.py:
      • import your parser
      • add it to the parser list (before GenericParser)
      • add the vendor key to GOLDEN_VENDORS when it has golden coverage
  6. Add golden tests

    • Update tests/unit/test_invoice_parsers.py with a new @pytest.mark.golden class for the vendor.
    • Assert at minimum: vendor_guess, totals, purchase date, invoice number (if applicable), and line count.
  7. Run tests

    • make test-golden
    • If you touch shared parsing behavior, run make test-unit too.

Workflow: Fix/Refine an Existing Parser

  1. Add a failing fixture first (new <name>_ocr.txt + <name>_expected.json).
  2. Make the smallest parser change that fixes the issue.
  3. Re-run make test-golden until green.
  4. Add/adjust validation_warnings when the receipt can’t be made internally consistent.

Debugging Playbook

  • One-shot OCR + parse (local workbench):

    • python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --file /path/to/receipt.pdf --entity corp
    • python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --ocr-text tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt --entity corp
  • OCR routing: packages/parsers/ocr/factory.py

  • Textract provider: packages/parsers/ocr/provider_textract.py

  • PDF embedded text extraction: packages/parsers/ocr/pdf_text_extractor.py

  • Dispatcher + golden vendor list: packages/invoice_parsers/vendor_dispatcher.py

  • Worker parse routing (incl. Claude Vision heuristic): services/worker/tasks/pipeline/parse.py

  • Golden fixtures: tests/fixtures/golden_receipts/

References (load as needed)

  • Agent prompt (Claude/Codex): references/agent-prompt.md
  • Repo/pipeline map: references/pipeline-map.md