In: Computer Science
Suppose there are two sets of letters D1 and D2 stored on two nodes A1 and A2: • D1 = {“a”, “b”, “c”, “b”, “c”, “d”, “a”} on Node A1, and • D2 = {“a”, “a”, “a”, “d”, “d”, “c”} on Node A2. There is a MapReduce job running to process D1 and D2. The job is a “get average” application, namely, computing the average number of copies for each letter. In this job, two Map tasks run on Node A1 and Node A2, and one Reduce task runs on another node, say, Node A3. Describe the key-value data transferred from Node A1 and Node A2 to Node A3, respectively, with and without a Combiner, respectively. Explain how a Combiner can improve a MapReduce job in this example.
As soon as we start run MapReduce job on very large data sets say D1 and D2 sets in this given example, eventually the mapper processes and produces huge chunks of intermediate O/P data then send to Reducer , Reducer causes network congestion. In order to increase efficiency a Combiner can be specify via Reducer.class, to perform specific local aggregation of the intermediate O/P, which assist to minimizes the amount of data transferred from the Mapper to eventually the Reducer. Combiner acts like a mini-reducer. Combiner processes and produce the O/P of Mapper and perform local aggregation before sending it to the reducer.
Example:
D1 = {“a”, “b”, “c”, “b”, “c”, “d”, “a”}
D2 = {“a”, “a”, “a”, “d”, “d”, “c”}
Primary function of Mappers is to split the I/P into key value
pairs. In the above example we have set D1 is having 7 (6 in case
of D2) input split it into 14 key(12 in case of D2) value pairs so
we have total 98 key-value pair which has to copied to reducer. The
reducer reduces 98 key-value pair into very less no of key value
pairs and transfer to reducer node. In this case all the I/P data
that has to be minimized
When we use combiner at the time of writing key value pair by a mapper for splitting I/P, then it writes very less number of key value pair in disk as compared with without using combiner.
Combiner can improve a MapReduce job in this
example.
1. Minimizes the time taken for transfer of data between mapper and
reducer
2. Decreases the amount of data that needed by reducer for
processing.
3. The overall efficacy of the reducer is drastically improved by
the combiner.