HACKER Q&A
📣 jbverschoor

OCR for 100 year old (German) handwritten cursive script?


I'm looking for an OCR solution for about 200 pages of text. It's handwritten German script from about 100 years old and I can barely read the handwriting myself. Google Translate sometimes manages to OCR certain parts, but nothing useful (I don't need the translation part of GT). Which solutions out there would be able to recognize old handwritten script?


  👤 sandreas Accepted Answer ✓
You can train Tesseract to recognize Handwriting[1], but the first and most important step would be the preprocessing of your documents. I would recommend to start with a local adaptive thresholding algorithm[2] like Sauvola for binarization. The preprocessing steps would be[3]

  1) Binarization
  2) Skew Correction
  3) Noise Removal
  4) Thinning and Skeletonization
Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.

In your case (only 200 pages) it might be easier to use template matching[5] to identify similar characters and just "transliterate" matches into modern printed letters (like an overlay over the original text). This way you would have a quick solution while still being accurate enough to just read it.

[1]: https://tesseract-ocr.github.io/tessdoc/#training-for-tesser...

[2]: https://brandonmpetty.github.io/Doxa/WebAssembly/

[3]: https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...

[4]: https://de.wikipedia.org/wiki/S%C3%BCtterlinschrift

[5]: https://docs.opencv.org/3.4/d4/dc6/tutorial_py_template_matc...


👤 sneed_chucker
You probably want to put it in front of an actual person and get them to transcribe it for you. I don't think there's any off the shelf OCR that will work particularly well for it.

I have a close family member who is a historian and frequently read and transcribed mid 19th to early 20th century German handwriting for his work.

Many historians and archivists in Germany would have the ability to transcribe this for you if you reached out to them and paid for their time.


👤 herbst
I've had surprisingly good results with https://readcoop.eu/transkribus/ I was going back in time with a family research until I couldn't identify a single word anymore. The 'AI' could.

👤 ebbes
That‘s exactly what https://transkribus.ai/ was built for - works quite well in my experience, mainly transcribing Deutsche Kurrentschrift, c. 1980.

👤 huijzer
I threw some German medical handwriting images into ChatGPT a while back and asked it to transcribe it and it worked pretty well. ChatGPT knows a lot about language so that helped in filling in the gaps.


👤 freosam
As others have mentioned, Transkribus works pretty well for handwritten text recognition. You can also train your own model if you have enough source material.

If the documents you have are able to be made public, you could upload them to Wikimedia Commons and use https://ocr.wmcloud.org/ — you can use Transkribus via that. (Disclosure: I'm an engineer working on the Wikimedia OCR project.)


👤 robertknight
You could try something like https://aws.amazon.com/textract/ or https://cloud.google.com/vision/docs/handwriting. Both have support for modern handwriting. I don't know if it will work with a script written a century ago though.

👤 dimatura
I don't know if it will be significantly different than what Google Translate does, but I would give the major cloud vendors (Google, Amazon, Microsoft and I guess OpenAI/ChatGPT) OCR services a shot. It's pretty simple and cheap to do (like, about a dollar for the whole thing). Last time I compared them, Google's OCR came out ahead, but it's task-dependent so in your case it might be different.

General purpose open-source OCR solutions like Tesseract, TrOCR, etc will probably not be as good as the cloud ones, based on my experience.

There's some specialized research work out there for antique manuscripts, but that will require some digging on your part with an uncertain outcome. I think at that point, I would also look into manual transcription - for 200 pages, it might be reasonably affordable.


👤 BenoitP
I worked in the same space as a company that does this with ML (and charges for it), using some form of Recurrent Neural Network IIRC. Maybe LSTMs?

They had a contract to index historical French archives composed of handwritten latin documents in elasticsearch.

Depending of the historical relevance of your documents (read: some academic funds), they may be able to help. Doesn't hurt to contact them:

https://teklia.com/


👤 josefritz
I've paid for manual transcription before. It's not that expensive. Technical solutions are cool, but that option is available today.

👤 rolltrunhert
A social option is to look around in ancestry study circles on facebook for your country. I know we have one or two pretty good groups in Swedish with mostly old folks helping younger decipher old handwriting.

👤 AJJB_alt
GPT-4 Vision. I have seen some examples of middly agy looking pages tried.

👤 lainga
Does it look like Sütterlin? Are you familiar with it?

[] https://en.wikipedia.org/wiki/S%C3%BCtterlin


👤 sneak
For only 200 pages, I’d farm it out to humans.

👤 weinzierl
Low hanging fruit when reading these old German scripts is to get used to distinguish the different forms of the letter s. That alone will get you far. Same for OCR, it needs to be capable of that. Otherwise the result will read as if someone without front teeth has written how they speak.

👤 jackhack
I ran a sample through my Apple Newton Messagepad: Iss Martha auf.