I have used many OCR solutions - Tesseract (4 and 5), EasyOCR, the TrOCR(not-document level), DocTR and Paddle-Paddle (self-hostable on GPUs), and lastly Textract(best).
Some are just about fast enough to be useful in production for long documents, but all have one thing in common: - You need to preprocess so much!
Why in this day and age do they all tend to output lines or words of text, completely leaving things like sorting out which text goes in which column or which bullet point is a new sentence?
I know solutions like GROBID solve this by correctly processing columns etc for papers, but for general documents, it seems so unsolved.
Are there good maintained solutions to this? At a team I am on, we spent a long time on an internal solution, which works well, and seeing the performance difference from raw processing to proper processing (formatting text and other improvements) has been -night-and-day-
So why don't providers or producers add steps to tidy up generic formats?
PS: I haven't found GPT APIs to be great for this, because the location and size of text is often crucial for columns and subheaders.
Some papers of relevance:
- Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes. "PubLayNet: largest dataset ever for document layout analysis," Aug 2019. Preprint: https://arxiv.org/abs/1908.07836 Code/Data: https://github.com/ibm-aur-nlp/PubLayNet
- B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar and P. Staar, "DocLayNet: a large human-annotated dataset for document-layout analysis," 13 August 2022. [Online]. Available: https://developer.ibm.com/exchanges/data/all/doclaynet/.
- S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie and R. Manmatha, "Docformer: End-to-end transformer for document understanding.," in The International Conference on Computer Vision (ICCV 2021), 2021.
The first one is for publications. From the abstract: "...the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated".The second is for documents. It contains 80K manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement.
I achieved pretty good results with a few simple steps before using tesseract:
- Sauvola adaptive thresholding (today there are many better algorithms, but sauvola is still pretty good)
- Creating Histogram based Patches for analysing what parts are text and what parts are images (similar to JBIG2)
I even once found a paper using an algorithm for detecting text-line slopes on geographical maps that was simple, fast and pure genius for curved text lines and then implemented a pixel mapper to correct these curved text lines. Unfortunately the whole project got lost somewhere in the NAS. Maybe I still have it somewhere, but Java was not the best language to implement this :-)
However, I think that even if I found a simple solution for some of my use cases - the whole OCR topic is pretty hard to generalize. Algorithms that work for specific use cases in specific countries don't work for others. And it is lots of hard work to capture all the fonts, typography, edge cases and performance problems in one piece of software.
These days it looks like ABBYY has pivoted towards cloud services and SDKs though, with the standalone software (now called FineReader PDF) de-emphasized. I am not sure if the new versions and services still offer column separation.
I'm not sure