The hope with digitization was that it would preserve all of these artifacts. However, AI approaches the point where imagery and textual data could be doctored on a large scale for pennies, and if data is stored only in a few places controlled by the same entities, altering it becomes trivial. If now I can obtain a scan of an old newspaper from a state library website and trust that it is authentic (because who has the time to professionally photoshop it?), soon there will be no way to be sure since one ill intended person with access could mix things up.
Decentralization seems like a modern solution to this problem, but I wonder if it is sustainable. To run some numbers, if we take the US as an example and just consider the Library of Congress, we would need to distribute 21 petabytes between 2,000-3,000 people with 10TB drives. Therefore, something like this seems to be feasible with a couple of thousand volunteers and $1-2 million for hard drives to start ($50-100 per TB), and then $100-200k per year (assuming a 10-year lifespan of a hard drive). But that's only one archive and quite a lot of people to organize.
To me, it seems that something like this is essential to preserve some form of trust in historical truth. And now is the time to act because otherwise, soon there will always be a second thought that what we are looking at is not real.
- Is there a smart way to solve this problem more easily and cheaply? - Or am I crazy and we should not be worried about it?
Microfilms. May be find a solution to make them even more densier and take even more data.
The information saved on them can be anything from img/text/encoded whatever.
Of course, it's a old technology. But, this can be copied, distributed and decentralized. Just the visual file format needs to be invented haha
The archives and libs storing backups of our past are usually well planned by librarians and other guys of topic. I dont think it's really necessary to think about this problems :) the other thing is: What is worth to be archived? What will happen with the massively generated content? How to decide? I see a poisoning of archives in the future:)
Are the Wikipedia and Internet Archive good models?
Technically a kind of thing like a benevolent version of Web3.0 might solve if it were not motivated solely by profit. IPFS and variants are the landscape at the moment. The consideration (reward) for participation is access to the whole corpus. You need to solve abuse problems which mean Wikipedian-like entry guards and curating of content insertion. Do you trust them? We'd need at least 10x redundancy, so imagine more like a few million people with 10TB to spare. That's not unreasonable in the next few years. But that needs to be maintained or else swathes of data will be lost forever if not enough copies are kept.
That means you need at least a few million people who seriously give a shit, and those are getting harder to find each day.
Also, with advances in compression, that could be a lot less space if you're prepared to trade-off access time. Like going to the library to find and scan a newspaper, image making a request that takes the distributed system several minutes or hours to lookup, assemble, collect and decode all the pieces. That doesn't sit well with a "give it me now!" culture.
For static a solution, burying a few petabytes of long-range storage at strategic (maybe secret) locations is an assurance against Nineteen Eighty Four/Fahrenheit 451 scenarios - a library to outlive the next tin-pot "Thousand year Reich". But that doesn't give anyone quick/random access to disputed facts and records.
A more political/human solution is to get more people caring about history and truth, ad-hoc curation and preservation. That requites not just liberation of data a la Aaron Swartz and Alexandra Elbakyan, but hugely increasing the number of people who will participate in that project. At this point, belief in the preservation of historical human knowledge means fighting the law,