Best Language for Programmatically Transforming Large Datasets?

Question

As part of my job I often find myself creating tools to transform large datasets. Some examples include restructuring GTFS feeds for structured consumption, or CRM migration. I come from web development so my tools of choice generally revolve around PHP and MySQL in a Drupal wrapper (for security, utilities, &c). As you might expect this approach seems slow and I know those tools aren&rsquo;t well suited to the task. What languages, frameworks, or apps do you use for similar tasks?

davismwfl · Accepted Answer

I've used a variety of languages, plain SQL + some bash scripting, nodejs, python, C++, C and C#.NET to name some. Hell, I've even used bash scripts with some sed/awk/grep etc to get it done but that isn't ideal most of the time.
The factors I'd consider when picking a language:
1. Datasource
2. Destination datasource
3. Number and type of translations
4. Source language that created the dataset (typically more relevant if you are pulling flat files or custom data structures, but always on the radar).
5. Datasize both in terms of storage size and number of records/transactions.
6. Data location, are my source and destination co-located or across the state, country or world from each other?
7. Data growth, do I need to keep running this process to keep up or is it a one and done ??
In general the lower the level language the longer it will take to code but it generally leads to more performant code. I usually opt for the easiest solution first and then work from there because as you are trying the easy one you'll find challenges which might push your decision a specific way. Today, I'd also look at Rust if I was doing a large dataset or working with binary data or I required more performance. I'd generally avoid PHP for this as it really isn't the right match, but if I had a small dataset and the creation was done via PHP I'd probably be tempted to just knock it out in PHP and be done (if it wasn't going to be around long).
Also, in the last 10 years or so when I do these types of projects I almost always use a mediator cache/queue like Redis or memcached + a queue etc. This gives me quick common lookups as well as I can process data in steps through the queue, survive failures and separate the code into clean sections which is nice for a lot of reasons.
I just wrote all that and didn't really answer your question with a direct answer, but that is because I don't think there is one correct or simple answer. I'd default to the easiest and work your way to the most complex based upon at least these factors, although many others exist.
There are tools that exist to do a lot of this but usually they are database focused. So if you are dealing with zip and text files etc it is hard to find COTS software to do it cleanly.