First, move the code, not the data. Batch processing is for mainframes. I know, I know, this has been impossible to realize any where I've ever worked.
Second, less is more. If command line tools work, use 'em.
Avoid IDLs, maps, schemas, visual programming, workflow engines, event sourcing, blah blah blah. It's all useless abstractions, measured by number of indirections and stack trace depth. It's all wicked hard to debug. It's all abandoned unmaintained obfuscation layers.
Data processing (ETL) is just cutting and pasting strings. Input, processing, output. Sometimes with sanity checks. Sometimes with transformations, like munging date fields or mapping terms ("yes" to "true"). Very rarely with accumulators (aggregators) where you need some local persistent state.
Third, and this is pretty rare, use better APIs for data extraction. It's all just scrapping. Don't over think it. I wish I could show the world the APIs I created for HL7 (healthcare) data. For 2.x, I created "fluent" (method chaining) data wrappers (like a DOM) which could not blowup (used Null Objects to prevent null pointer exceptions). For 3.x, I used path query thingie to drill down into those stupid XML files. This was CODE, not mappings, so it was practically a REPL, meaning fast to code, fast to debug.
Fourth, you control the execution. Be more like Postfix/Qmail, where each task has it's own executable. Be less like J2EE, BizTalk, where you ask the runtime to control the lifecycle of your code.
Good luck.
We just built a pretty involved ETL process over the past couple of months that had two key components, a true batch ETL process and then a real-time synchronization between two Postgres databases on different servers. In our case, real-time is basically less than 2 seconds for this application so nothing insane, we routinely see it working in about 300-500ms.
We could've solved it all using just SQL to be fair, but found that it was better even for the batch ETL process to utilize SQL only for the data CRUD and leave the business logic to the application code we wrote (it is in nodejs). This let us take advantage of some caching which reduced the database server load dramatically during the processing, id lookups and other known values being the core area we reduced with cache. It also let us keep our business rules in code which is where we have them for data validation etc, so we could reuse existing code sections to maintain consistency and not try to replicate some pretty sophisticated validations and transforms in SQL.
I have also worked on projects in the past where the entire process was done using SQL and it is doable but I'd say it was less than ideal and definitely can cause significant overhead on the DB servers which is unnecessary most of the time.
I had one project where the entire ETL process was in MSSQL with stored procs and SSIS, spread across multiple servers. That system worked really well overall (integrity was super good and reasonably maintainable for SQL people), but again strained the DB servers a lot and was fairly slow. We moved most of that ETL process to C# at the time given it was a MS house, and the difference was dramatic. Load on the DB servers dropped and ETL that would take hours to run on SSIS with stored procs etc took minutes with C# and the utilization of memcached. In fact on that one we even wrote extensions for SQL Server using .NET and had SSIS call our external code which used the caching engine to reduce DB lookups amongst a bunch of other things. Pretty damn cool solution but definitely wasn't simple, lots of smart people worked on that project to make it super reliable and fast.
For small/medium scale environments Fivetran/Stitch with Snowflake/BigQuery using getdbt.com for modelling is an insanely productive way to build an analytics stack. I consider this the default way of building a new data stack unless there's a very good reason not to.
For larger scales Facebook has Presto, Google has Dremel/Procella/others, and a lot of data processing is done using SQL as opposed to writing code.
The only downside really is that it tends to be fairly focussed on batch pipelines (which are fine for 95% of workloads). But even that is becoming less of an issue with Beam/Spark so you can use SQL for both batch and streaming.
Source: Solution Architect at an analytics consultancy.
The most important thing is to separate your infrastructure layer from your business logic, and have the infrastructure layer satisfy interfaces that are defines in business logic layer.
That way, when my datastore switches from FTP to S3 or MySQL to Postgres, I can swap out the datastore client, make sure the new one satisfies the interface, and the rest of the application works just the same.
The microservice's database to track and schedule runs is in Postgres, with SELECT FOR UPDATE providing an easy way to treat a Postgres table as a work queue.
If you're used to software development, DBT gives you a very similar workflow for managing the T and L portions of the job.
Doing this in plain SQL or Spark is complicated.
This is usually a job for commercial ETL tools like informatics, talend, data stage...
It really depends on where you want the data and how you want to use it.