HACKER Q&A
📣 rawgabbit

What Will Replace ETL?


In all the places I have worked, we ETL data from source systems and copy them over to data warehouses, reporting platforms, message/orchestration platforms, etc. It is a lot of work and expensive to sync all of these platforms. I have come to the conclusion that multiple SAAS, applications, and microservices are not going away. My question is how can we reduce the amount of ETL needed?


  👤 PaulHoule Accepted Answer ✓
Last week I was showing somebody some old pitch decks from a time when I was pitching a solution for “low-code ETL” and some related things.

I built a prototype for a system that streamed JSON-like structures (implemented in RDF) through various processing “boxes” like you’d see in a tool like LabView or Alteryx or KNIME. The system used production rules to set up and tear down a reactive streaming fabric, the tear down being important if you want to get the right answers in batch mode. The system would download the data dump from

https://www.gleif.org/en

and do some data cleanup and indexing and merging with other data sets to make a web site for browsing that data. I made an AMI that would build a new copy of the site when it booted up and thus updated the site every day.

The theory behind it was that the data pipelines were more scalable than batch SPARQL queries and easier to create than many other tools so that we could make a system that looks at the metadata, does some profiling, automatically builds a draft of the data import script, the script could then be whipped into shape by applying “patches” to it, facilities for testing the ETL and parts of it would be built in too.

One bit of feedback we got from people in the data analytics space was that our system wouldn’t support columnar query processing so it was too slow and they wanted nothing to do with it.

When I look at that time period and how it worked out I think now the market of low-code development of applications is a better one than low-code support for analytics but the application is the thing that makes money and the spend on analytics is almost a rounding error compared to operations.

If you want to see my decks look up my profile.


👤 e9
I think the dumbest thing in a lot of ETL is the API: - company collects data in database then creates API on top of - another company uses this API to put stuff into their own database waste of everyone's time, just share the relevant database or relevant dump in some way this will make everyone's life easier, this of course doesn't work for dynamic data

👤 cloths
A consistent and forward compatible data schema across the organization should reduce the amount the work for T part at least. But things are easily to diverge and difficult to converge. You gonna need some disciplines and top-down executions.