I was initially thinking of just building a one-off ETL script, but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?
> but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?
This is a political decision. I think you have two options:
- Set a clearly defined schema, and validate against that. Whenever the input doesn't match the schema, put it in a separate 'error folder' and notify someone to edit the data so it can be retried.
- Accept schema changes. Either by having some sort of auto migrations, or by simply not defining a schema at all (like a document store).
Imho, the last option is not a good one if you want to build analytics on top of it, because you need to have some cleaned up and structured version of the data that you can rely on.
- get a good SLA in place (you can’t promise to load their data within an hour if they essentially can send random crap, for example)
- Add a zero to what you charge them.
Alternatively, walk away from that customer.