MapReduce?
A friend recently asked me to explain her the concept behind MapReduce. The last time we saw each other, I had no time to do that, so I now use my blog for this purpose.
Map and reduce are particular functions that can be found we in numerous functional languages (such as python, ocaml, etc.).
Map is a function that takes as input an other function, let’s say $latex f$ and a data structure (most of the time a sequence), that calls the function $latex f$ on each of the structure’s items and returns a list of the return values. Here is an example: suppose that I have a sequence $latex S=[1,5,4,2,6,7]$ and that I want to compute the square of each of the integers in this sequence using the shortest line of code possible. I use map for this purpose and call $latex map(x^2,S)$, map then applies $latex x^2$ on each item of the sequence $latex S$ and outputs $latex [1,25,16,4,36,49]$.
Reduce is a function that also takes as input an other function, let’s say $latex f$ and a data structure (most of the time a sequence), but that calls the function $latex f$ not on each item of the structure, but on the whole structure. For instance $latex map(+,S)$ outputs the sum of the integers in $latex S$, e.g. $latex 25$.
I can now explain what is the concept of the famous Google’s MapReduce.
This is a computation framework for large distributable data sets. Suppose you have a very large data set (let’s say an entire crawl of the web), and that you want to apply on it a very quickly a time consuming algorithm using a lot of computers. The first step is to take the input (in our case this huge set of web pages) and to cut it into small problems that can be solved independently by your large set of computers. Each of this computer has the goal to solve one of these small problems, then to send the result back to one specific computer (the master). This first step is the map step of MapReduce. At this point the results given by the computers are all gathered (recombined) in order to obtain the result of the original problem. This is the reduce step.
This approach is nice: computation are done in a distributed way, so the time spent for the computation is very short. The computation is fault-tolerant (if a computer crashes, its sub problem is given to an other one). But unfortunately MapReduce can only be applied to very specific problems (where there are no communication between computers during the computation).
To finish a crappy whiteboard+iphone picture to illustrate this concept:
The goal is of course to « colorize quickly the picture »…