How do you digitize documents?

Question

Any recommendations on scanning documents, bills, articles, etc. Advice on scanners, software and workflows would be greatly appreciated.

simonblack · Accepted Answer

I store most digitised documents as .PDFs.

Very often you can obtain original .PDFs from companies by downloading from websites, as well as (or instead of) the paper documentation they send you.

For local scanning, I use a HP MFP. If I need to scan individual pages, I can then merge those, if necessary, with a 'merge.pdf' type of software utility.

Store the scanned/downloaded documents in some type of tree-structured directory format. This greatly reduces the time taken to find a specific document.

I keep financial documents separate from other documents. Financial documents are also segregated into separate tax-year 'trees'.

Documents are backed-up month by month, and also daily. The monthly back-ups are stored indefinitely, and separately from the daily back-ups which are deleted in reverse chronological 'exponential' order.

Daily-backups remaining at the moment. Day 0000 was back on 23rd June 2012. Last word is server name. Note how there are more recent backups than earlier backups:

     0000-120623nullius
     1024-150401nullius
     2048-180131centrepoint
     2304-181014centrepoint
     2560-190627centrepoint
     2688-191102centrepoint
     2720-191204centrepoint
     2736-191220centrepoint
     2752-200105centrepoint
     2756-200109centrepoint
     2758-200111centrepoint
     2759-200112centrepoint
     2760-200113centrepoint

throwaway78678 · Answer

I've got a decent brother scanner like so https://www.ebay.com/p/13030519316, when I scan a document it ends up on a folder from my NAS.
I've built a small webapp that reads the content of this folder as untagged documents. Tagging them will move them to a proper folder and the docs will finally be visible in a treeview.
It is relatively robust and low maintenance. I might at some point work on download + OCR scripts to get and auto-tag bills and such that are already in PDF. Not sure if it is really useful to be honest at this point

rfmw19 · Answer

My method was more specific to bills and finance documents. I used a generic photo scanner. It's not as automatic as the purpose-built document scanners that have automatic feeders and support multiple pages, but I wanted something that I could use for photography as well.
I coupled this with some very hacked together Perl scripts with Tesseract OCR[1] that fed in data to ledger-cli[2] for handling bills. I put other generic documents into folders by date.
It worked pretty well, and I was able to generate some pretty graphs from data that was fully reconciled with financial institutions like my bank, credit card, investments, etc., but still took too much time. So what do I do now? Nothing!
This was years ago. I assume there is now better support from financial institutions for extracting data and this coupled with improved OCR/machine learning might make things more robust and make it worthwhile to try again.
[1] https://en.wikipedia.org/wiki/Tesseract_(software)
[2] https://www.ledger-cli.org/

clintonb · Answer

What&rsquo;s your goal? I haven&rsquo;t received a paper bill in years. They are already digitized. Same for most news/magazine articles. Aside from older/historical documents, nearly every piece of paper I encounter has a digital counterpart that I can access in some form.

2rsf · Answer

with bills the quality is secondary, and indexing is more important. I scan using Microsoft Office Lens and email to myself adding a few keywords in the title "Electricity bill for November 2020"