In: Computer Science
For each of the following statements, state whether the statement is true or false AND justify your answer.
Answer:
1. When a map-reduce job is running on Hadoop, the number of map tasks must be equal to the number of reduce tasks because each map task feeds its output to a specific reduce task.
This statement is False.
Explanation:
Basically, hadoop does not allow you to specify the number of map task and is governed by the number of input splits. The number that you pass using mapred.tasks.parameter just gives a suggestion to hadoop framework about the number of maps. Hence, in any MR job regardless of the number of map task specified, a map task will always be spawned for each input split and eventually, the number of map tasks is equal to the number of input splits. The number of partitions is equal to the total number of reduce jobs for the process.
2. In order to guarantee the scalability of the system, each file block is replicated three times in Hadoop distributed file system (HDFS).
This statement is True.
Explanation:
This can be justified by the following example:
Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into 5 blocks ( 514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will be of 2 MB only. Since, we are using the default replication factor i.e. 3, each block will be replicated thrice. Therefore, we will have 15 blocks in total where 12 blocks will be of size 128 MB each and 3 blocks of size 2 MB each.
3. When a map-reduce job is submitted to Hadoop, the first step is to shuffle and sort the input key/value pairs so that each input key is assigned to a map task.
This statement is False.
Explanation:
The process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing. So first step is splitting. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Please give thumbsup, if you like it. Thanks.