In: Computer Science
Choose any distributed system (DB, file system or anything else) and explain its architecture from fault tolerance, consensus, recovery and logging point of view.
Distributed Systems:
Distributed Systems are composed of various hardware and software that communicate with each other only by transfer of messages. These components are placed inside a single network. So, when we think about the architecture styles for distributed computing the main task would be to ensure that these can communicate with each other over a communication network. There are five different types of architecture have been used in the distributed computing environment.
1. Layered Pattern
2. Client - Server Pattern
3. Master - Slave Pattern
4. Peer to Peer Pattern
5. Model View Controller Pattern
Now I am going to explain about the HDFS.
Hadoop Distributed File System:
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. HDFS stores very large files running on a cluster of commodity hardware. It employs a Master slave architecture to implement a distributed file system provides high-performance access to data across highly scalable Hadoop clusters.
Architecture:
HDFS NameNode:
NameNode in HDFS Architecture is also known as Master
node. HDFS Namenode stores meta-data i.e. number of
data blocks, replicas and other details. This meta-data is
available in memory in the master for faster retrieval of data.
NameNode maintains and manages the slave nodes, and assigns tasks
to them. It should deploy on reliable hardware as it is the
centerpiece of HDFS.
Task of NameNode
HDFS Datanode:
DataNode in HDFS Architecture is also known as Slave.
DataNode stores actual data in HDFS. It performs read and write
operation as per the request of the client. DataNodes can deploy on
commodity hardware.
Task of DataNode
Secondary NameNode:
When the primary namenode fails, In HDFS, the secondary namenode starts to work automatically. In HDFS when NameNode starts, first it reads HDFS state from an image file,the secondary name node stores all the information about the datanodes and name nodes.
Additional Nodes :
Checkpoint Node:
The Checkpoint node is a node which periodically creates checkpoints of the namespace.
Backup Node:
A Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the active NameNode state.
Blocks in HDFS Architecture:
HDFS in Apache Hadoop split
huge files into small chunks known as Blocks. These are
the smallest unit of data in a filesystem. We (client and admin) do
not have any control on the block like block location. NameNode
decides all such things.
The default size of the HDFS block is 128 MB, which we can
configure as per the need. All blocks of the file are of the same
size except the last block, which can be the same size or
smaller.
If the data size is less than the block size, then block size will
be equal to the data size. For example, if the file size is 129 MB,
then it will create 2 blocks. One block will be of default size 128
MB. The other will be 1 MB only and not 128 MB as it will waste the
space. Hadoop is intelligent enough not to waste rest of 127 MB. So
it is allocating 1Mb block only for 1MB data. The major advantages
of storing data in such block size are that it saves disk seek
time.
HDFS Operations:
Write
Operation:
When a client wants to write a file to HDFS, it
communicates to namenode for metadata. The Namenode responds with a
number of blocks, their location, replicas and other details. Based
on information from Namenode, client split files into multiple
blocks. After that, it starts sending them to first Datanode. The
client first sends block A to Datanode 1 with other two Datanodes
details. When Datanode 1 receives block A from the client, Datanode
1 copy same block to Datanode 2 of the same rack. As both the
Datanodes are in the same rack so block transfer via rack switch.
Now Datanode 2 copies the same block to Datanode 3. As both the
Datanodes are in different racks so block transfer via an
out-of-rack switch. When Datanode receives the blocks from the
client, it sends write confirmation to Namenode. The same process
is repeated for each block o the file.
Read Operation:
To read from HDFS, the first client
communicates to namenode for metadata. A client comes out of name
node with the name of files and its location. The Namenode responds
with number of blocks, their location, replicas and other
details.
Now client communicates with Datanodes. The client starts reading
data parallel from the Datanodes based on the information received
from the namenode. When client or application receives all the
block of the file, it combines these blocks into the form of an
original file.
Features:
Fault Tolerance:
Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on another rack. Hence if suddenly a machine fails, then a user can access data from other slaves present in another rack. To learn more about Fault Tolerance follow this Guide.
High Availability:
HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access this data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster. To learn more about high availability follow this Guide.
Data Reliability:
HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It also stores data reliably on a cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating a replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then a user can easily access that data from the other nodes which contain a copy of same data in the HDFS cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.
Replication:
Data Replication is one of the most important and unique features of Hadoop HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node, hardware failure, and so on. Since data is replicated across a number of machines in the cluster by creating blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines present in the cluster. Hence whenever any machine in the cluster gets crashed, the user can access their data from other machines which contain the blocks of that data. Hence there is no possibility of losing of user data. Follow this guide to learn more about the data read operation.
Scalability:
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the cluster. There is two scalability mechanism available: Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any downtime.
Distributed Storage:
In HDFS all the features are achieved via distributed storage and replication. HDFS data is stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every block are created and stored on other nodes present in the cluster. So if a single machine in the cluster gets crashed we can easily access our data from the other nodes which contain its replica.