Document Extraction Domain Model
Core Definition: Parsing financial statements using vision models with confidence scoring.
Data Flow
mermaid
flowchart TB
A[Upload PDF/Image/CSV] --> S[Store to Object Storage]
S --> P[Create PARSING Statement]
P --> B{File Type}
B -->|PDF/Image| C["OpenRouter Vision Model"]
B -->|CSV| D[Structured Parser]
C --> E[Extract JSON]
D --> E
E --> F{Confidence Score}
F -->|≥85| G[Auto-Accept]
F -->|60-84| H[Review Queue]
F -->|<60| I[Manual Entry]
G --> J[(PostgreSQL)]
H --> J
Confidence Scoring
| Factor | Weight | Criteria |
|---|---|---|
| Balance Check | 40% | opening + Σtxn ≈ closing (±0.1) |
| Field Completeness | 30% | Required fields present |
| Format Consistency | 20% | Valid date/amount formats |
| Transaction Count | 10% | Reasonable (1-500) |
Thresholds:
- •≥85: Auto-accept
- •60-84: Review queue
- •<60: Manual entry required
Supported Institutions
| Institution | Format | Tier |
|---|---|---|
| DBS/POSB | v1 | |
| CMB (China Merchants Bank) | v1 | |
| Maybank | v1 | |
| Wise | PDF/CSV | v1 |
| Brokerage (generic) | PDF/CSV | v1 |
| Insurance (generic) | v1 | |
| OCBC | Extended | |
| MariBank | Extended | |
| GXS | Extended |
Data Integrity
To prevent floating-point errors:
- •AI Output: LLM prompt requests monetary values as numbers or strings
- •Pydantic Validation: NEVER use
floatforamountfields. MUST useDecimal - •Database Storage: Stored as
DECIMAL(18,2)
Parsing Resilience
- •Bucket auto-create: storage ensures the bucket exists before upload
- •Orphan cleanup: if DB persistence fails after upload, the uploaded object is deleted
- •Stuck job supervisor: statements stuck in
parsinglonger than 30 minutes are markedrejected
Source Files
- •Models:
apps/backend/src/models/statement.py - •Schemas:
apps/backend/src/schemas/extraction.py - •Logic:
apps/backend/src/services/extraction.py - •Validation:
apps/backend/src/services/validation.py - •Storage:
apps/backend/src/services/storage.py