I want to build a system that can keep a running tab of my purchases by item, price, and store. I need to find a library that can effectively scan a receipt, recognize the store (usually name, number, address and logo at the top), and differentiate each item label and its price. I plan to manually tag each item label from a store's receipt with the item's barcode the first time it is seen.
I have been sporadically googling the past 6 months but am still unsure which OCR library(s) I should invest my time in. Or how low level I should start. Should I grab a library like tesseract and do my own feature extraction or libs that spit out semi-structured objects with text and hope it returns something similar enough across store receipts to make sense of consistently?
I'm ok with this being an extended project, but I would like some input on choosing a solid library with accurate OCR and advice on how to approach training/parsing from someone with more experience.
Other solutions and advice are also welcome++
In my Opinion Tesseract is the most sophisticated "free" OCR solution out there. The problem with Tesseract is not its recognition capabilities, but more the preprocessing steps.
- thresholding
- deskewing
- segmentation
- ...
There is a C# library (non-free), that improves recognition A LOT, just by providing these abilities: https://www.vintasoft.com/vsocr-dotnet-index.htmlIf you find a good Open Source solution, I would be interested, too...
slides: http://slides.com/rolisz/receiptbudget#/1
code: https://github.com/rolisz/receipt_budget
research article: https://www.authorea.com/users/6050/articles/6335-a-novel-ma...
Based on that, your best bet might be https://github.com/ReceiptManager/receipt-parser-legacy, which is a Python library built on top of the Tesseract OCR engine. You can use it containerized, in Android/iOS applications, or via your own Python scripts.
To some extent, all this is solved by some modern APIs, such as what GCP or AWS offer, for doing OCR for you. But as far as I know, there is still one more challenge: interpreting the text. Inferring what each line is, what's the price for which item (some receipts have the price on the same line, some on the next line, some above) is quite hard. I tried to do it with rules (regexes and lots of ifs), but even a 95% accuracy of the OCR engine will trip you up.
You can probably frame this as an ML problem as well, but I don't think you'll find any datasets for this.
However, if you are looking for a project, picking one grocery store with one receipt format and generally limited/consistent product coding schemes is a reasonable thing to plug away on. Speaking personally I did this with Whole Foods receipts for a while and was able to get to almost, kinda usable. But then the pandemic hit and I started ordering delivery which obviates the whole receipt ingestion thing because I can get all those details directly from Amazon (modulo doing some data scraping).
Analytics on food purchases are a tremendously interesting and deeply underexplored space in which there is lots of future commercial potential.
Unless there's some pressure through government regulation to implement this, it won't happen though ... because who's least interested in customers comparing prices and having transparency in their spendings? The retailers obviously.
If you want to chat feel free to reach out, i could talk all day about this stuff.
discl@imer - I verk 4 not-Macro-Hard. But, I have no connection to this team.
edit: this might be terribly extra for personal use.
Having said that, I am sure there must be some existing accounting software with built-in OCR? Probably even an app?
Google‘s MLKit is very accurate for on device recognition. You can even feed frames straight from the camera with almost real time results. Your bigger problem will be parsing the results, and handling very inconsistent receipts.
If time is of the essence simply use AWS Textract & be done with its free tier.
I used it for years to scan our bank statements (before our bank could export data).
It was the only thing I ever found that handled tabular data properly.