Best way to perform complex OCR task in 2023?

Question

Our desktop app need to be able to detect ALL the text on screen while software/web app are open.It need to "see the screen", "see your software".We naively thought this would be a piece of cake (OCR is like a 50 years old technology right?) but unfortunately it's not!Anyway we tried tesseract alone (50% accuracy) tesseract with a bunch of preprocessing techniques (grayscale, scalex4, othsu, noise reduction, binarization...) we go up to 80/90% accuracy depending on the software but then any new preprocessing techniques lead to different result depending on the software opened.Finally we tried openCV EAST and EasyOCR but they either do nothing or are way too slow.Google cloud vision seems to be able to do the job but we do not have the money. GPU accelerated detector also work but we cannot force our user to have a gpu.Last thing since its text in software that interrest us we also tried windows UI but it work on a limited number of software unfortunately.Never knew not being able to get all the text on screen would spell the end of my startup.Anyway thanks for reading this. If you've ever encountered a problem like this let me know what you tried.

robertknight · Accepted Answer

Other than EasyOCR and Tesseract, PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) is probably the most well known open-source OCR solution.What are you planning to do with the text after detecting / recognizing it? How fast does the detection / recognition need to be in order to be useful?