Or is it more sensible on an accuracy-vs-cost standpoint to just run a transformers model like TrOCR after identifying bounding boxes with textual data with something like CRAFT or EAST?
short example list:
* fixed font character text on blank background; human hand writing set against busy city street background
* converting non-text font image to text description. (collage of images forming illusion of text font)