In: Computer Science
But if you were given the responsibility of installing a Hadoop 100+ node cluster, what would you do. By the way there are lots of solutions in the marketplace both commercial and proprietary. You may want to look around. Let's say that you successfully installed a 100+ Hadoop node cluster.
During the operation of the cluster, things are bound to fail. What failures do you think may happen in the cluster and what solutions/tools do you think you may want to employ to monitor and discover the failures?
There can be three types of failure in Hadoop Cluster:
NameNode Failure:
NameNode was a SPOF(Single point of failure) in HDFS cluster. Each
cluster consists of a single NameNode, and if that machine goes
down the cluster would be unavailable.
NameNode can be down in two cases:
1.) In case of crash or failure.
2.) In case of maintenance.
How to Overcome?
Have a Secondary NameNode in the cluster, that can be used in the
stage the primary NameNode failure occurs.
DataNode Failure:
Each DataNode sends a Heartbeat message to the NameNode
periodically. A network partition can cause a subset of DataNodes
to lose connectivity with the NameNode. The NameNode detects this
condition by the absence of a Heartbeat message.
How to overcome?
There are generally more than one dataNode in the cluster. Having a
dataNode failure does not critically impact the Hadoop cluster.
However there can be replication overhead in case of Node failure.
Having buffer dataNode can help alliviate the problem.
SecondaryNode Failure:
Hadoop Cluster will run when it crashes. You can run Hadoop claster
even without it and it is not used for high availability.
For all fail case, try "sudo jps". you will get process id and process name. Then do "sudo kill -9 {process-id}". Then try to read/write data in hdfs or pig/hive shell.