Batch Ingestion Skill

Purpose

Process large volumes of bank statements (50+ PDFs) in manageable batches with:

•Progress tracking via TodoWrite
•User verification between batches
•Resume capability using database ingestion_log
•Error recovery without losing progress

When to Use

•Processing 50+ PDFs at once
•Want to review progress between batches
•Need ability to pause and resume
•Want checkpointing in case of errors

Prerequisites

•Database initialized (uv run python scripts/init_db.py)
•PDFs in staging folder (data/statements/staging/)
•Dashboard stopped (docker compose down)

Workflow

Phase 1: Discover and Plan

•Count PDFs in staging:

bash

ls data/statements/staging/*.pdf | wc -l

•Ask user for batch size:

•Recommend 10-15 PDFs per batch
•Smaller batches = more checkpoints
•Larger batches = faster but less granular

•Create TodoWrite plan:

python

total_pdfs = 73  # from ls count
batch_size = 15
num_batches = (total_pdfs + batch_size - 1) // batch_size

todos = [
    {"content": f"Process batch {i+1} of {num_batches} ({batch_size} PDFs)",
     "status": "pending",
     "activeForm": f"Processing batch {i+1}"}
    for i in range(num_batches)
]
# Add final todo
todos.append({
    "content": "Restart dashboard after completion",
    "status": "pending",
    "activeForm": "Restarting dashboard"
})

# Use TodoWrite tool to create the plan

Phase 2: Process Batches

For each batch:

•
Mark batch as in_progress using TodoWrite
•
Get next batch of PDFs:

bash

ls data/statements/staging/*.pdf | head -n 15

•Run ingestion skill on this batch:

•Use the standard skills/ingestion/SKILL.md
•Process all PDFs in the batch
•
The ingestion skill will handle:
- •PDF reading and parsing
- •Database insertion
- •Categorization
- •Archiving
- •Error handling

•
Mark batch as completed using TodoWrite
•
Show progress summary:

python

from src.database.models import Database
from src.config import get_config

config = get_config()
db = Database(config["database_path"])

# Get most recent ingestion log
log = db.get_last_ingestion_log()
if log:
    print(f"✓ Batch completed")
    print(f"  PDFs processed: {log.pdfs_processed}")
    print(f"  Transactions added: {log.transactions_added}")
    print(f"  Status: {log.status}")

•Ask user to continue:

code

Batch X of Y completed. Continue to next batch? (yes/no)

If user says no: Stop and remind them they can resume later.

Phase 3: Resume Capability

To resume an interrupted batch ingestion:

•Check remaining PDFs:

bash

ls data/statements/staging/*.pdf | wc -l

•
Check TodoWrite list to see which batches are pending
•
Continue from where you left off - just process the remaining batches

Error Handling

If a PDF fails during batch processing:

•The ingestion skill will log the error
•Continue with remaining PDFs in batch
•Summarize failed PDFs at end of batch
•User can review and retry failed PDFs separately

Progress Tracking

The ingestion skill automatically updates the ingestion_log table:

•Tracks PDFs processed
•Tracks transactions added/updated
•Records errors
•Stores summary

Query progress:

python

from src.database.models import Database

db = Database("/path/to/finance.db")

# Get all ingestion runs
conn = db.get_connection()
logs = conn.execute("""
    SELECT started_at, completed_at, status, pdfs_processed, transactions_added
    FROM ingestion_log
    ORDER BY started_at DESC
    LIMIT 10
""").fetchall()

for log in logs:
    print(f"{log[0]}: {log[3]} PDFs, {log[4]} transactions ({log[2]})")

Advantages Over Single Run

•Checkpoint progress - Resume anytime
•User control - Review between batches
•Memory management - Process in chunks
•Error isolation - One bad PDF doesn't stop everything

Example Session

code

User: "I have 73 PDFs to process. Run batch ingestion."

Claude: I'll process them in batches of 15 PDFs each (5 batches total).

[Creates TodoWrite plan with 5 batch todos]

Processing batch 1 of 5...
[Runs ingestion skill on first 15 PDFs]
✓ Batch 1 completed: 15 PDFs, 287 transactions

Continue to batch 2?

User: yes

Processing batch 2 of 5...
[Continues...]

Notes

•Uses the standard skills/ingestion/SKILL.md for actual processing
•No complex Python scripts required
•All state tracked in database + TodoWrite
•Can stop and resume anytime
•Simpler and more maintainable than old batch system