Question

In: Computer Science

Choose any distributed system (DB, file system or anything else) and explain its architecture from fault...

Choose any distributed system (DB, file system or anything else) and explain its architecture from fault tolerance, consensus, recovery and logging point of view.

Solutions

Expert Solution

Distributed Systems:

Distributed Systems are composed of various hardware and software that communicate with each other only by transfer of messages. These components are placed inside a single network. So, when we think about the architecture styles for distributed computing the main task would be to ensure that these can communicate with each other over a communication network. There are five different types of architecture have been used in the distributed computing environment.

1. Layered Pattern

2. Client - Server Pattern

3. Master - Slave Pattern

4. Peer to Peer Pattern

5. Model View Controller Pattern

Now I am going to explain about the HDFS.

Hadoop Distributed File System:

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. HDFS stores very large files running on a cluster of commodity hardware. It employs a Master slave architecture to implement a distributed file system provides high-performance access to data across highly scalable Hadoop clusters.

Architecture:

HDFS NameNode:

NameNode in HDFS Architecture is also known as Master node. HDFS Namenode stores meta-data i.e. number of data blocks, replicas and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS.
Task of NameNode

  • Manage file system namespace.
  • Regulates client’s access to files.
  • It also executes file system execution such as naming, closing, opening files/directories.
  • All DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop cluster. It ensures that the DataNodes are alive. A block report contains a list of all blocks on a datanode.
  • NameNode is also responsible for taking care of the Replication Factor of all the blocks

HDFS Datanode:

DataNode in HDFS Architecture is also known as Slave. DataNode stores actual data in HDFS. It performs read and write operation as per the request of the client. DataNodes can deploy on commodity hardware.
Task of DataNode

  • Block replica creation, deletion, and replication according to the instruction of Namenode.
  • DataNode manages data storage of the system.
  • DataNodes send heartbeat to the NameNode to report the health of HDFS. By default, this frequency is set to 3 seconds.

Secondary NameNode:

  • When the primary namenode fails, In HDFS, the secondary namenode starts to work automatically. In HDFS when NameNode starts, first it reads HDFS state from an image file,the secondary name node stores all the information about the datanodes and name nodes.

Additional Nodes :

Checkpoint Node:

The Checkpoint node is a node which periodically creates checkpoints of the namespace.

Backup Node:

A Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the active NameNode state.

Blocks in HDFS Architecture:

HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the smallest unit of data in a filesystem. We (client and admin) do not have any control on the block like block location. NameNode decides all such things.
The default size of the HDFS block is 128 MB, which we can configure as per the need. All blocks of the file are of the same size except the last block, which can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file size is 129 MB, then it will create 2 blocks. One block will be of default size 128 MB. The other will be 1 MB only and not 128 MB as it will waste the space. Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data. The major advantages of storing data in such block size are that it saves disk seek time.

HDFS Operations:

Write Operation:
When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with a number of blocks, their location, replicas and other details. Based on information from Namenode, client split files into multiple blocks. After that, it starts sending them to first Datanode. The client first sends block A to Datanode 1 with other two Datanodes details. When Datanode 1 receives block A from the client, Datanode 1 copy same block to Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transfer via rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in different racks so block transfer via an out-of-rack switch. When Datanode receives the blocks from the client, it sends write confirmation to Namenode. The same process is repeated for each block o the file.

Read Operation:

To read from HDFS, the first client communicates to namenode for metadata. A client comes out of name node with the name of files and its location. The Namenode responds with number of blocks, their location, replicas and other details.
Now client communicates with Datanodes. The client starts reading data parallel from the Datanodes based on the information received from the namenode. When client or application receives all the block of the file, it combines these blocks into the form of an original file.

Features:

Fault Tolerance:

Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on another rack. Hence if suddenly a machine fails, then a user can access data from other slaves present in another rack. To learn more about Fault Tolerance follow this Guide.

High Availability:

HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access this data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster. To learn more about high availability follow this Guide.

Data Reliability:

HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It also stores data reliably on a cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating a replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then a user can easily access that data from the other nodes which contain a copy of same data in the HDFS cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.

Replication:

Data Replication is one of the most important and unique features of Hadoop HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node, hardware failure, and so on. Since data is replicated across a number of machines in the cluster by creating blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines present in the cluster. Hence whenever any machine in the cluster gets crashed, the user can access their data from other machines which contain the blocks of that data. Hence there is no possibility of losing of user data. Follow this guide to learn more about the data read operation.

Scalability:

As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the cluster. There is two scalability mechanism available: Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any downtime.

Distributed Storage:

In HDFS all the features are achieved via distributed storage and replication. HDFS data is stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every block are created and stored on other nodes present in the cluster. So if a single machine in the cluster gets crashed we can easily access our data from the other nodes which contain its replica.


Related Solutions

How artists people love their job more than anything else? explain with example
How artists people love their job more than anything else? explain with example
Can you please explain the following thoroughly: Is anything 100% safe or 100% foolproof? (from any...
Can you please explain the following thoroughly: Is anything 100% safe or 100% foolproof? (from any perspective such as the drug approval process at the FDA, airline safety, new products, or the prospects for economic growth)
Choose any behavior in any animal you wish and explain that behavior from each of Tinbergen’s...
Choose any behavior in any animal you wish and explain that behavior from each of Tinbergen’s four levels of causation (proximate, ontogenetic, adaptive value, evolution).
Choose any one embedded system application device or equipment. With aid of block diagram, explain the...
Choose any one embedded system application device or equipment. With aid of block diagram, explain the operation of the system, advantages and disadvantages of chosen device or equipment.
Choose 3 terms to any of the following below and give its application. Explain the procedure...
Choose 3 terms to any of the following below and give its application. Explain the procedure on how it’s done and explain why/how it is related to the term. - Magnetic Field - Magnetic Force - Bio Savart Law - Electrostatic Force -Electrostatic Field - Hall Effect - Mass Spectrometer
Choose and explain any of the pricing models in your own terms. What are its advantages...
Choose and explain any of the pricing models in your own terms. What are its advantages and disadvantages?
Choose any RC column section and draw its interaction diagram. Discuss and explain the process.
Choose any RC column section and draw its interaction diagram. Discuss and explain the process.
200 words Choose one portion of the digestive system and explain its individual function and how...
200 words Choose one portion of the digestive system and explain its individual function and how it contributes to the breakdown and/or absorption of food. Please try to pick a new portion, with minimal students repeating the same area.
Choose a manufacturing company and explain if its inventory should use a hybrid-costing system (combination of...
Choose a manufacturing company and explain if its inventory should use a hybrid-costing system (combination of process costing and job order costing). Explain what type of manufacturing company would choose to use a hybrid-costing system.
Explain Business Intelligence in three to four paragraphs. Focus upon what it is, its architecture and...
Explain Business Intelligence in three to four paragraphs. Focus upon what it is, its architecture and purpose.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT