HACKER Q&A
📣 dmje

3k museums submitting data. How?


I'm helping with a project that will see ~3k small museums submitting their collections data to a central data lake. It can be assumed that most museums struggle with technology and have little / no resource in-house. The data itself varies from an xls / CSV file to a database dump, to all flavours of thing in-between. File size: almost nothing all the way up to maybe 100mb. The files will change regularly, so the solution will need to cater for this.

My questions are twofold:

First, what options do I have on the data collection side? Should we hook into something familiar like Dropbox or are there other seamless / easy to manage ways of requesting a local data sync? Making a form on a submission website was our first thought but data transfer is likely to be slow / impossible for some museums so some kind of desktop file-watcher seems less brittle, maybe?

Second: I know I haven't given nearly enough information for detail, but are there existing technologies on the receiving end that could ingest disparate, non-mapped, schema-less data of all flavours? Should we look at something like Node Red or Parabola do so some mapping, or are there other generic data receiving technologies we should be aware of that can cope with a multitude of data types, shapes, sizes?

I realise this isn't enough information for a complete answer but even some links / hints / ideas of what to look out for would be appreciated. Thanks, HN :-)


  👤 shakna Accepted Answer ✓
> Second: I know I haven't given nearly enough information for detail, but are there existing technologies on the receiving end that could ingest disparate, non-mapped, schema-less data of all flavours? Should we look at something like Node Red or Parabola do so some mapping, or are there other generic data receiving technologies we should be aware of that can cope with a multitude of data types, shapes, sizes?

Traditionally, the answer to this is... Data entry. You hire people to convert the data. Because you don't have any guarantee that the data you are getting is clean, nor do you have any guarantee it is compatible, even when getting it from the same source.

Until you're at a rather large scale, it is cheaper to hire a bunch of people willing to do drudge work, than it is to try and automate this. ISPs have rounds of hiring for data entry every year, right before the end of the financial year, because it's still cheaper to hire people for data entry for two or three weeks than it is to make all their receipt handling systems automated.


👤 forgotmypw17
I would store the original data unaltered as files.

Then I would index and transform it into the lowest common denominator formats, probably something like this:

For text, plaintext with header-like fields. For images and video, the most commonly readable formats.

Then I would build index database(s) on top, probably SQLite.

And on top of that, I would make a lightweight web-based system for querying and categorizing.

This way, the data modification happens only one way, and it is easy to maintain transparency and clarity of process. You never lose your original data, so if there’s a problem, you can go back one step before the problem. And no relying on proprietary or low-support data formats.


👤 ilaksh
I think the only really effective existing technologies for "ingesting" disparate data are called "data scientist" and/or "software engineer".

As written using the word "data" the question doesn't really make sense to me because that implies structure.

But it implies that you want to aggregate data from multiple sources into something coherent.

If this is not the case, then you could consider something like a wiki. Or consider it anyway as a starting point. Or maybe Elasticsearch.

With the current information as specified, the exercise seems impossible and pointless.

As you said, more information is required. What specific data would be useful to which specific individuals in those organizations or external to them and how would they use that specific type of information? Start by detailing that and then figure out how to get there.

This sounds like an exercise created by a non-technical person who was not interested nor could understand the requirements. So you have to do requirements engineering and the primary risk for the project is the incompetence of the person who gave you the vague data aggregation task and their ability to waste your time.


👤 codingdave
I've done similar projects with smaller file sizes - pulled a few thousand organization's worth of data together, transformed it to a single common structure, and published it online.

Shipping data was the easy part - if it is small enough to attach to an email, do so. Otherwise send us a USB. Like you said, if the org have no in-house tech, even getting them to sync on Dropbox was more trouble than having them send a USB.

For the actual data ingestion, we did not use any existing tech. We wrote a series of API calls that would take a specific data structure and put that data in our system in a common format. We then wrote scripts that would handle different types of incoming data, whether that is CSV, SQL DBs, binary files, or really anything they chose to send us, and transform it to our needed structure. We took the burden on ourselves of reading their files or standing up DB servers and writing SQL commands to pull the data into the shape we needed, and passing it to our scripts.

While that sounds like a lot of one-off work (and it was), there are not that many formats they are going to send you - outside of CSVs, Zip files, and database archives... non-tech folk really aren't going to come up with anything too crazy. (The only outrageous formats we got were from coders.) So write a script for the first one, then adjust the code for the next. The time you spend looking for tech that will slice and dice it just the way you want is probably not all that different than just writing import scripts.

All that to say - you might want to approach this not as one large data effort, but as 3000 small efforts. Re-use code where you can, streamline the processes as you go, automate it where possible. But don't over-engineer a solution before you even do your first import. Just get rolling, and the opportunities to streamline the efforts will become evident as you progress.


👤 m3047
Every unbounded attempt (e.g. scraping) at collecting data in the wild that I've seen (and I've seen a lot) ends up borrowing against future tech (just push through, we'll solve it at scale) and ends up with an increasingly unmaintainable pile of "integrations" when that advantage of scale doesn't arrive.

After you've done and seen this a few times you want to give some avuncular advice, but people generally will have none of it. There are a few hints of it here.

Your comment about brittleness. The comment about having them send USB drives. Retaining the unaltered original data. "3000 small efforts".

Typically there won't be do many basic formats, but even the version of the software may treat the semantics of that format differently. (Could be a bug!)

This is not a new problem. It's been around at the network layer long enough that there is an early RFC ranting about it in terms of "heffalumps".

If this is not for profit, feel free to stalk me and reach out for that avuncular advice.


👤 mastry
If this is a long-term (on-going) project I would consider building an app for the museums to track their collection data. The app would sync to your central storage system. You would need to support a lot of import formats, but the app would eliminate the whole data collection problem for the long-term.

👤 interactivecode
I think traditionally this is done with interns and email or shipping drives.

👤 speedgoose
You may want to look into an object storage solution with an S3 compatible API. Then you have something extremely common and relatively standard. The museums can upload using many many tools, from command line to Software libraries to graphical user interfaces.

A few links if you want to look into it:

- https://en.m.wikipedia.org/wiki/Object_storage

- https://min.io/

- https://docs.ceph.com/en/latest/radosgw/s3/

- https://aws.amazon.com/s3/

- https://cyberduck.io/s3/

- https://winscp.net/eng/docs/s3


👤 mattewong
You are in luck. My company is launching a free solution to exactly this problem. You will be able to normalize tables of data (from csv, txt, xlsx and more) that have the same subject matter but different formats. Variations in names, scalings, enumerations, derivations and more are automatically handled. You won’t need to hand over your data and you can choose to share the process you’ve created with anyone else in the world (or keep it private). Email me at info@liquidaty.com, we would love to help you with this as part of our private beta and I’m certain it will save you a ton of pain (and anyone who comes after you who you'd like to enable to solve this same problem in the future)

👤 iamwpj
IME people like it when they can submit something and move on with their life. I think a small data explorer for web can go a long way and a simple web portal that allows submission of zipped files. That might cover the first 90%, then you can deal with the last 10% that need additional help.

For your second question about data ingest, unless you're enforcing standards on receiving data (for example, check it as people try to submit and fire back if it's not right) I don't see a way to index except by set schema. Elasticsearch would allow you to index on dynamic fields, from there you could merge data.


👤 edmundsauto
Have you examined common formats, or key partners that you need to order to succeed? If you try to solve for everyone, you will need huge amounts of resources. OTOH, if you build for a few high profile partners, you can leverage that social proof to use their file format as a standard. IE, get the long tail onboard once they see aspiration examples, then influence their choice of file formats (or ask them to invest in data entry).

Don't be all things to all people, you need to define scopes and phases. Think like a PM.


👤 detaro
Also talk to other museums etc. You're unlikely to be the first to have that problem in this specific domain.

👤 ktpsns
Adressing the second question, you might want to look into data warehousing. There is a world to discover on how to managing or merging various databases and schemes within a single "warehouse".

👤 bhakimm
I have significant experience aggregating multiple data types / streams and would be happy to help. Shoot me an email if you're interested - happy to chat. bhakimm at wirepulse dot com

👤 faangiq
People who don’t know data, handing it off to someone who doesn’t know data.

👤 econk
Could I get access to this once it is complete?