MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.
MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:
- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)
- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
It's going to stay because it is useful:
Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.
If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)
It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.
Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.
“I have data and I know SQL. What is it about your database that makes retrieving it better?”
Any other paradigm is going to be a niche at best, likely outright fail.
The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.
there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.
even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.
i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.
i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.
Its definitely not a dead concept, I guess its not sexy to talk about though.
I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.
basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.