HACKER Q&A
📣 sosilkj

Is SQL a primary tool of choice for doing ETL pipelines in 2019?


Is SQL considered a primary tool of choice for ETL pipelines in 2019? Benefits/drawbacks?


  👤 specialist Accepted Answer ✓
I've used and created numerous ETL stacks. So learn from my mistakes.

First, move the code, not the data. Batch processing is for mainframes. I know, I know, this has been impossible to realize any where I've ever worked.

Second, less is more. If command line tools work, use 'em.

Avoid IDLs, maps, schemas, visual programming, workflow engines, event sourcing, blah blah blah. It's all useless abstractions, measured by number of indirections and stack trace depth. It's all wicked hard to debug. It's all abandoned unmaintained obfuscation layers.

Data processing (ETL) is just cutting and pasting strings. Input, processing, output. Sometimes with sanity checks. Sometimes with transformations, like munging date fields or mapping terms ("yes" to "true"). Very rarely with accumulators (aggregators) where you need some local persistent state.

Third, and this is pretty rare, use better APIs for data extraction. It's all just scrapping. Don't over think it. I wish I could show the world the APIs I created for HL7 (healthcare) data. For 2.x, I created "fluent" (method chaining) data wrappers (like a DOM) which could not blowup (used Null Objects to prevent null pointer exceptions). For 3.x, I used path query thingie to drill down into those stupid XML files. This was CODE, not mappings, so it was practically a REPL, meaning fast to code, fast to debug.

Fourth, you control the execution. Be more like Postfix/Qmail, where each task has it's own executable. Be less like J2EE, BizTalk, where you ask the runtime to control the lifecycle of your code.

Good luck.


👤 davismwfl
There is no on size fits all tool for good ETL processes IMO.

We just built a pretty involved ETL process over the past couple of months that had two key components, a true batch ETL process and then a real-time synchronization between two Postgres databases on different servers. In our case, real-time is basically less than 2 seconds for this application so nothing insane, we routinely see it working in about 300-500ms.

We could've solved it all using just SQL to be fair, but found that it was better even for the batch ETL process to utilize SQL only for the data CRUD and leave the business logic to the application code we wrote (it is in nodejs). This let us take advantage of some caching which reduced the database server load dramatically during the processing, id lookups and other known values being the core area we reduced with cache. It also let us keep our business rules in code which is where we have them for data validation etc, so we could reuse existing code sections to maintain consistency and not try to replicate some pretty sophisticated validations and transforms in SQL.

I have also worked on projects in the past where the entire process was done using SQL and it is doable but I'd say it was less than ideal and definitely can cause significant overhead on the DB servers which is unnecessary most of the time.

I had one project where the entire ETL process was in MSSQL with stored procs and SSIS, spread across multiple servers. That system worked really well overall (integrity was super good and reasonably maintainable for SQL people), but again strained the DB servers a lot and was fairly slow. We moved most of that ETL process to C# at the time given it was a MS house, and the difference was dramatic. Load on the DB servers dropped and ETL that would take hours to run on SSIS with stored procs etc took minutes with C# and the utilization of memcached. In fact on that one we even wrote extensions for SQL Server using .NET and had SSIS call our external code which used the caching engine to reduce DB lookups amongst a bunch of other things. Pretty damn cool solution but definitely wasn't simple, lots of smart people worked on that project to make it super reliable and fast.


👤 dalailambda
SQL has definitely become the defacto tool for a lot of data processing. This model of working is generally referred to as ELT as opposed to ETL.

For small/medium scale environments Fivetran/Stitch with Snowflake/BigQuery using getdbt.com for modelling is an insanely productive way to build an analytics stack. I consider this the default way of building a new data stack unless there's a very good reason not to.

For larger scales Facebook has Presto, Google has Dremel/Procella/others, and a lot of data processing is done using SQL as opposed to writing code.

The only downside really is that it tends to be fairly focussed on batch pipelines (which are fine for 95% of workloads). But even that is becoming less of an issue with Beam/Spark so you can use SQL for both batch and streaming.

Source: Solution Architect at an analytics consultancy.


👤 jstrebel
Coming from a Big Data (Hadoop / AWS) background, I would say that SQL is a good choice for analyzing data once it is in the data warehouse, but it is not needed to get the data there. Most of the ETL code in this environment is based on Apache Spark 2.x and either Scala or Python as programming language. On a Hortonworks HDP cluster, you would use PySpark directly; on AWS, you would use AWS Glue which allows you to set up and schedule the PySpark code. Although PySpark does not stop you from using SQL, there is a preference to write the transformations in terms of native Spark functions. So, to answer the question: no, SQL is not a primary tool of choice for implementing ETL pipelines in 2019 - PySpark is.

👤 IpV8
Generally yes. You can use gui tools like Matillion or AlteryX to orchestrate, or you can just run SQL as commands or view materializations. With the advent of natively scalable SQL based databases such as Snowflake, SQL is clawing back territory from the 'Big Data' databases rapidly. It is way more flexible to throw a ton of hardware at SQL as opposed to pre-conforming your data for specific usecases with non-relastional databases. Obviously if you're talking realtime or sensor data SQL may not be your best strategy, but you're question is really broad.

👤 fjp
I've been doing ETL-heavy work as (async) Python microservices.

The most important thing is to separate your infrastructure layer from your business logic, and have the infrastructure layer satisfy interfaces that are defines in business logic layer.

That way, when my datastore switches from FTP to S3 or MySQL to Postgres, I can swap out the datastore client, make sure the new one satisfies the interface, and the rest of the application works just the same.

The microservice's database to track and schedule runs is in Postgres, with SELECT FOR UPDATE providing an easy way to treat a Postgres table as a work queue.


👤 thenanyu
I work at Mode, an analytics tool company. We get questions about data engineering from our customers all the time, and the solution we recommend 90% of the time is DBT

https://www.getdbt.com/

If you're used to software development, DBT gives you a very similar workflow for managing the T and L portions of the job.


👤 eb0la
If you are doing a serious ETL job, you'll need to handle rejects - I mean, a way to identify which rows of data cannot be handled gracefukky by your ETL.

Doing this in plain SQL or Spark is complicated.

This is usually a job for commercial ETL tools like informatics, talend, data stage...


👤 tracker1
Personally, I'm a fan of scripts for this type of work... Node, Python, Perl, etc. That's just me though. I know a lot of people that live and breath SSIS packages and similar.

It really depends on where you want the data and how you want to use it.


👤 vivan
As with most questions about stack choice: it depends.

👤 llampx
It depends on your sources and whether you want to do real-time or not.