In: Computer Science
Explain in details Hadoop architecture. Give an example of Hadoop MapReduce example using real data
Hadoop is a distributed framework for processing big data. Hadoop has HDFS (Hadoop Distributed File System) for big data storage and MapReduce framework for processing. Hadoop has three major versions till date i.e. Hadoop 1, 2 and 3.
Hadoop 1.0 Architecture: Hadoop 1.0 has HDFS as file system and Map Reduce as a faster processing framework. It has following components:
1. Job Tracker
2. Task Tracker
3. Name Node
4. Data Nodes
5. Secondary Namenode
Here, Master-Slave architecture is followed where master is name node and slaves are data nodes. Master name node keep tracks of data using FS table and assigns tasks to data nodes for that it uses Job Tracker. In short, Job tracker assigns tasks to task trackers on every data node. Secondary name node is used if name node fails. To keep track of health of name node and other nodes, heart bit is checked frequently (a signal). Rack awareness is used while assigning tasks so that data locality should be followed to make architecture more effective.
Hadoop 2.0 Architecture: Hadoop 2.0 also has HDFS as file system and Map Reduce as a faster processing framework.But it has following components:
1. Resource Manager
2. Application Master
3. Containers
4. Node Manager
Basically it is also a master-slave architecture, it has different components as well. Resource manager takes care of resource scheduling and availability. Application master negotiates with Resource manager for resources and assigns it to different node managers for different data nodes. Containers are the storage available on data nodes. Hadoop 2 is superior over Hadoop 1.
How map reduce works?
Data is divided into blocks of data. Block is unit of data. For Hadoop 1 it is 64 MB and for Hadoop 2 is 128 MB.
Then these blocks are assigned to different task trackers on various nodes. This is called distribution or mapping. Now the transformations are performed in map, then shuffle sort phase. Sorting will sort the data from various data nodes which also requires shuffling in between nodes. Once the operations are performed, reduce phase will collect the data from various nodes and forms the output.
Real life example for map reduce framework:
For example, our production data query is calculating maximum of salaries from specific department and departments are active which involves:
- Grouping on department
- Having clause with active flag
- Where clause for other predicates
- Aggregation function MAX on this result set
So below are the steps taken in map reduce phase:
1. Input splitter will split the data into blocks of data and blocks are assigned to number of data nodes based on number of mappers property. Here map phase works.
2. Once data is mapped, where clause operations are performed while shuffle sort phase, group also starts along with other operations
3. While reduce phase, aggregation gets performed to find maximum value, then output is written finally onto HDFS