I wound up with a pipeline of pdftotext -> configurable regexes to capture the transactions within their respective sections (banks list credits and debits separately without indicating the sign in the amount field) -> BNF parser to turn transaction lines into data, then checks start balance + transactions = end balance.
PITB but works well.
Over the winter will be standing up a local model to see whether a sophisticated prompt can reliably accomplish the same.
Not going to base any workflow on my transaction data on hosted models.
1. Tabula (https://tabula.technology): a free and open-source tool.
2. Parsio (https://parsio.io): uses pre-trained AI models for data extraction from PDFs, emails, and other formats.
3. Airparser (https://airparser.com): uses GPT approach similar to ChatGPT for data extraction from PDFs, emails, and other formats.