Very often you can obtain original .PDFs from companies by downloading from websites, as well as (or instead of) the paper documentation they send you.
For local scanning, I use a HP MFP. If I need to scan individual pages, I can then merge those, if necessary, with a 'merge.pdf' type of software utility.
Store the scanned/downloaded documents in some type of tree-structured directory format. This greatly reduces the time taken to find a specific document.
I keep financial documents separate from other documents. Financial documents are also segregated into separate tax-year 'trees'.
Documents are backed-up month by month, and also daily. The monthly back-ups are stored indefinitely, and separately from the daily back-ups which are deleted in reverse chronological 'exponential' order.
Daily-backups remaining at the moment. Day 0000 was back on 23rd June 2012. Last word is server name. Note how there are more recent backups than earlier backups:
0000-120623nullius
1024-150401nullius
2048-180131centrepoint
2304-181014centrepoint
2560-190627centrepoint
2688-191102centrepoint
2720-191204centrepoint
2736-191220centrepoint
2752-200105centrepoint
2756-200109centrepoint
2758-200111centrepoint
2759-200112centrepoint
2760-200113centrepoint
I've built a small webapp that reads the content of this folder as untagged documents. Tagging them will move them to a proper folder and the docs will finally be visible in a treeview.
It is relatively robust and low maintenance. I might at some point work on download + OCR scripts to get and auto-tag bills and such that are already in PDF. Not sure if it is really useful to be honest at this point
I coupled this with some very hacked together Perl scripts with Tesseract OCR[1] that fed in data to ledger-cli[2] for handling bills. I put other generic documents into folders by date.
It worked pretty well, and I was able to generate some pretty graphs from data that was fully reconciled with financial institutions like my bank, credit card, investments, etc., but still took too much time. So what do I do now? Nothing!
This was years ago. I assume there is now better support from financial institutions for extracting data and this coupled with improved OCR/machine learning might make things more robust and make it worthwhile to try again.