HACKER Q&A
📣 weitzj

Which document management system with OCR do you use?


I am looking for a way to get rid of my manual papertwork and ideally I am looking for some nice scan-workflow-engine, where I can scan my documents, run them through OCR, manually tag them appropiately and store the documents somewhere on a NAS. It would also be nice if I can integrate accounting in this solution. Since I am probably not alone with this I am just curious what others do.


  👤 johntash Accepted Answer ✓
I've tried some tools like Mayan and Paperless, but am always disappointed with them. I wanted a super organized, auto-tagged, etc document store. I ended up caring less each year and have settled on a directory with scans organized by date that I can search through. Occasionally I'll rename the files or move them somewhere else, but that's not too often anymore.

My current setup is:

- Scanner hooked up to a raspberry pi

- Push button on the scanner, the pi scans it in and saves a .tif file per page to my nas

- A script running in a k8s pod monitors that folder, performs some steps on each .tif file like increasing the contrast, cutting off the edges, detecting blank pages, etc

- The same script then converts those .tif files to a pdf and runs Tesseract on it for OCR

- That pdf gets uploaded to a folder in my nextcloud instance

It's not great, but I can either use my local file explorer to search through the ocr'd PDFs or (more slowly) I can search inside Nextcloud's web ui using the fulltextsearch plugin.


👤 djvdorp
There apparently also is paperless-ng: https://github.com/jonaswinkler/paperless-ng

And this slightly older FileBasedMiniDMS: https://github.com/stweiss/FileBasedMiniDMS

I tried both Mayan and Paperless (regular) myself to replace my Evernote Premium setup, they haven't convinced me yet.

I am currently trying out https://github.com/jbarlow83/OCRmyPDF myself (had to fork it to add my own language to the Dockerfile) and then will either let my NAS index it afterwards or Dropbox/Nextcloud maybe. Apparently locally they get indexed very well with either Gnome (Linux) or Finder (Mac) or Explorer (Windows).


👤 maartenhendrix

👤 ankrgyl
I’m the founder of a company called Impira (https://www.impira.com/) which specifically solves this problem. We don’t scan the documents for you, but we do solve the other steps you mentioned. Feel free to sign up on our website or reach out to me (info in profile) if I can help think through a solution!

👤 rexelhoff
Evernote.

I use Swiftscan Pro on iOS to take pics of receipts or single page documents. It OCRs and pushes them into Evernote.

I use PDFPen Pro on Mac to OCR longer documents scanned in using the office scannner/printer. This is triggered when I drop a file into a monitored folder. My applescript fires PDFPen, performs OCR and then imports into Evernote.

I have another monitored folder to import PDFs that don't require OCR (just import into Evernote)

And lastly if I get an email with a PDF attachment, I forward it to my special Evernote email address where it's automatically imported.

The main reason I haven't moved away from Evernote is because I want access to my files on all my devices and at any possible moment, and I want the service I use to outlive me. Evernote so far hasn't failed on either of these promises (though the latter does possibly worry me)