HACKER Q&A
📣 SpeakerFrThDead

How to build data pipelines to continuously ingest customer files?


I'm working on onboarding a customer who wants to deliver their data as a weekly dump of CSVs. We’re supposed to ingest that data and get it into our system to provide analytics for their team.

I was initially thinking of just building a one-off ETL script, but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?


  👤 Avalaxy Accepted Answer ✓
Tools-wise there are many ways to do it. From simple custom scripts to more advanced solutions like a data lakehouse. It depends on the budget and the needs I suppose.

> but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?

This is a political decision. I think you have two options:

- Set a clearly defined schema, and validate against that. Whenever the input doesn't match the schema, put it in a separate 'error folder' and notify someone to edit the data so it can be retried.

- Accept schema changes. Either by having some sort of auto migrations, or by simply not defining a schema at all (like a document store).

Imho, the last option is not a good one if you want to build analytics on top of it, because you need to have some cleaned up and structured version of the data that you can rely on.


👤 Someone
- Automate schema detection and reporting of discrepancies (to both the client and you), and make that 100% robust.

- get a good SLA in place (you can’t promise to load their data within an hour if they essentially can send random crap, for example)

- Add a zero to what you charge them.

Alternatively, walk away from that customer.


👤 brudgers
The standard way is manually until there's enough experience to automate some of it away. Understanding the actual problem first is a direct step toward a robust solution to the actual problem. Building a non-trivial process is a process. Good luck.