HACKER Q&A
📣 vivzkestrel

Recommendations for self hostable OCR to extract code from images


- Requirements

- You are not paying per inference, you can self host the model

- It can run inside AWS EC2

- It has very high levels of accuracy for extracting code from images

- what are some of the most accurate OCR models out there that can extract code from images


  👤 vivzkestrel Accepted Answer ✓
- as you know most models are trained on PDF, receipts, normal text etc

- this however doesn't work really well for structured text like code

- what are some absolutely state of the art self hostable OCR models out there capable of extracting code from text with very high levels of accuracy

- I have tried tesseract currently and it is not very good with this. Even if you are not familiar with any other model, perhaps you can suggest a pipeline for tesseract that I can follow to improve the accuracy of the extraction process

- Currently, my pipeline looks like this:

- for every input image, check if the image is light text on dark background or dark text on light background

- as you know tesseract is trained from mostly dark text on light background so I invert the images with dark background before processing them with tesseract

- are there other processes you think that I need to include?