HACKER Q&A
📣 bk146

Does (or why does) anyone use MapReduce anymore?


Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?


  👤 tjhunter Accepted Answer ✓
(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.


👤 dtoma
The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.


👤 throwaway5959
I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.

👤 dehrmann
At a high level, most distributed data systems look something like MapReduce, and that's really just fancy divide-and-conquer. It's hard to reason about, and most data at this size is tabular, so you're usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.

👤 BenoitP
The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)


👤 atombender
That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

👤 dikei
MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.


👤 qoega
Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.


👤 monero-xmr
The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.


👤 justrealist
You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.


👤 codr7
It definitely played its role in high lighting what most functionish coders already knew: that map/filter/reduce is an awesome model for data processing pipelines.

👤 nathants
the idea of map reduce remains a good one.

there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.

even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.

i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.

i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.

1. https://github.com/nathants/s4

2. https://github.com/nathants/bsv


👤 dijit
I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.


👤 KaiserPro
From what I recall Map:Reduce was basically a weird representation of DAG based pipeline.

I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.

basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.


👤 Dopameaner
The paradiagm is very much alive. Look at Query_then_fetch in elasticsearch