My questions are twofold:
First, what options do I have on the data collection side? Should we hook into something familiar like Dropbox or are there other seamless / easy to manage ways of requesting a local data sync? Making a form on a submission website was our first thought but data transfer is likely to be slow / impossible for some museums so some kind of desktop file-watcher seems less brittle, maybe?
Second: I know I haven't given nearly enough information for detail, but are there existing technologies on the receiving end that could ingest disparate, non-mapped, schema-less data of all flavours? Should we look at something like Node Red or Parabola do so some mapping, or are there other generic data receiving technologies we should be aware of that can cope with a multitude of data types, shapes, sizes?
I realise this isn't enough information for a complete answer but even some links / hints / ideas of what to look out for would be appreciated. Thanks, HN :-)
Traditionally, the answer to this is... Data entry. You hire people to convert the data. Because you don't have any guarantee that the data you are getting is clean, nor do you have any guarantee it is compatible, even when getting it from the same source.
Until you're at a rather large scale, it is cheaper to hire a bunch of people willing to do drudge work, than it is to try and automate this. ISPs have rounds of hiring for data entry every year, right before the end of the financial year, because it's still cheaper to hire people for data entry for two or three weeks than it is to make all their receipt handling systems automated.
Then I would index and transform it into the lowest common denominator formats, probably something like this:
For text, plaintext with header-like fields. For images and video, the most commonly readable formats.
Then I would build index database(s) on top, probably SQLite.
And on top of that, I would make a lightweight web-based system for querying and categorizing.
This way, the data modification happens only one way, and it is easy to maintain transparency and clarity of process. You never lose your original data, so if there’s a problem, you can go back one step before the problem. And no relying on proprietary or low-support data formats.
As written using the word "data" the question doesn't really make sense to me because that implies structure.
But it implies that you want to aggregate data from multiple sources into something coherent.
If this is not the case, then you could consider something like a wiki. Or consider it anyway as a starting point. Or maybe Elasticsearch.
With the current information as specified, the exercise seems impossible and pointless.
As you said, more information is required. What specific data would be useful to which specific individuals in those organizations or external to them and how would they use that specific type of information? Start by detailing that and then figure out how to get there.
This sounds like an exercise created by a non-technical person who was not interested nor could understand the requirements. So you have to do requirements engineering and the primary risk for the project is the incompetence of the person who gave you the vague data aggregation task and their ability to waste your time.
Shipping data was the easy part - if it is small enough to attach to an email, do so. Otherwise send us a USB. Like you said, if the org have no in-house tech, even getting them to sync on Dropbox was more trouble than having them send a USB.
For the actual data ingestion, we did not use any existing tech. We wrote a series of API calls that would take a specific data structure and put that data in our system in a common format. We then wrote scripts that would handle different types of incoming data, whether that is CSV, SQL DBs, binary files, or really anything they chose to send us, and transform it to our needed structure. We took the burden on ourselves of reading their files or standing up DB servers and writing SQL commands to pull the data into the shape we needed, and passing it to our scripts.
While that sounds like a lot of one-off work (and it was), there are not that many formats they are going to send you - outside of CSVs, Zip files, and database archives... non-tech folk really aren't going to come up with anything too crazy. (The only outrageous formats we got were from coders.) So write a script for the first one, then adjust the code for the next. The time you spend looking for tech that will slice and dice it just the way you want is probably not all that different than just writing import scripts.
All that to say - you might want to approach this not as one large data effort, but as 3000 small efforts. Re-use code where you can, streamline the processes as you go, automate it where possible. But don't over-engineer a solution before you even do your first import. Just get rolling, and the opportunities to streamline the efforts will become evident as you progress.
After you've done and seen this a few times you want to give some avuncular advice, but people generally will have none of it. There are a few hints of it here.
Your comment about brittleness. The comment about having them send USB drives. Retaining the unaltered original data. "3000 small efforts".
Typically there won't be do many basic formats, but even the version of the software may treat the semantics of that format differently. (Could be a bug!)
This is not a new problem. It's been around at the network layer long enough that there is an early RFC ranting about it in terms of "heffalumps".
If this is not for profit, feel free to stalk me and reach out for that avuncular advice.
A few links if you want to look into it:
- https://en.m.wikipedia.org/wiki/Object_storage
For your second question about data ingest, unless you're enforcing standards on receiving data (for example, check it as people try to submit and fire back if it's not right) I don't see a way to index except by set schema. Elasticsearch would allow you to index on dynamic fields, from there you could merge data.
Don't be all things to all people, you need to define scopes and phases. Think like a PM.