In: Computer Science
Introduction to Big Data
1-Explain the two main method of MAPREDUCE?
2-What are the different node can be found in Hadoop eco-system?
3-Advantage of using Hadoop?
Answer 1:
Mapreduce is a style of computing or processing technique as well as a programming model relevant to distributed computing. It is used to process large amounts of data parallelly on large clusters in the most reliable manner. It can be used in such a way so as to make the large-scale computations tolerant to hardware faults.
The two main methods of MapReduce are as follows:
1) Map
The Map function accepts an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. A Map task can produce several key-value pairs with the same key, even from the same element. These elements given as input could be anything like a tuple or even an entire document.
2) Reduce
The Reduce function takes as argument a pair comprising of a key and its list of values. The yield of the Reduce function is a sequence of zero or more key-value pairs. A Reduce task executes one or more reducers and outputs from all the Reduce tasks are merged into a single file.
Answer 2:
The different nodes that can be found in the Hadoop ecosystem are:
1) NameNode
It is the most important Hadoop daemon. For distributed storage, Hadoop employs a master/slave architecture. This storage system is called the Hadoop file system (HDFS). The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. It is also referred to as the book-keeper of HDFS. The function of the NameNode is memory and I/O intensive. But the negative side to this is that it is a single point of failure for the Hadoop cluster.
2) DataNode
This node is hosted in each slave machine to perform grunt work of the distributed filesystem. This work is reading and writing HDFS blocks to actual files on the local filesystem. DataNodes are constantly reporting to the NameNode. If any DataNode crashes or becomes inaccessible over the network, you’ll still be able to read the files.
Answer 3:
The advantages of using Hadoop are:
1) It is an open-source framework and therefore it is free. It uses commodity hardware to store and process the data, so the cost is low.
2) If any node fails then tasks running are automatically redirected to other nodes. Moreover, multiple copies of all data are stored automatically which is useful when nodes fail.
3) The administration required is very less as nodes can be easily added and removed along with that failed nodes are detected with ease.
4) Hadoop has the provision for huge and flexible storage due to thousands of nodes in the cluster.
5) Hadoop is advantageous because of its high computing power which is received due to the thousands of nodes in the clsuter.