A Hadoop cluster is a collection of independent components designed specifically for storing and analyzing large amounts of unstructured data in a distributed computing environment. The clusters run on the Hadoop open source software and are usually referred to as “share nothing” systems because nothing is shared between the Hadoop cluster nodes except for the network that connects them.
In this article:
Create, manage, and optimize large scale deployments easily with our On Demand OpenStack Powered IaaS.
Hadoop Cluster Architecture
Hadoop is a master-slave model made up of three components. Within a Hadoop cluster, one machine in the cluster is designated as the NameNode and another machine as the JobTracker, these are the masters. The rest of the machines in the cluster are DataNodes and TaskTrackers, these are the slaves. The masters coordinate the roles of many slaves. The table below provides more information on each component.
- Master Node — The master node in a Hadoop cluster is responsible for storing data in the Hadoop Distributed Filesystem (HDFS). It also executes the computation of the stored data using MapReduce, which is the data processing framework. Within the master node, there are three additional nodes: NameNode, Secondary NameNode, and JobTracker. NameNode handles the data storage function with HDFS and Secondary NameNode keeps a backup of the NameNode data. JobTracker monitors the parallel processing of data using MapReduce.
- Slave/Worker Node — The slave/worker node in a Hadoop cluster is responsible for storing data and performing computations. The slave/worker node is comprised of a TaskTracker and a DataNode. The DataNode service communicates with the Master node in the cluster.
- Client Nodes — The client node is responsible for loading all the data into the Hadoop cluster. It submits MapReduce jobs and outlines how the data needs to be processed, then retrieves the output once the job is complete.
Advantages of a Hadoop Cluster
Hadoop clusters are highly scalable, allowing users to easily add additional cluster nodes to increase throughput. Known to boost the speed of data applications, Hadoop is also highly resistant to failure as each piece of data is copied onto other cluster nodes. If the one node fails, then the copy of the data on the other node in the cluster can be used for analysis.
Hadoop clusters are great for the deployment and management of cloud data processing services. With the ability to easily scale resources when needed, the software can streamline the efforts of many companies in various industries.
Experience Hadoop clusters in action with servers purpose selected for Hadoop and for our Ceph Object Storage. Available soon within our Hosted Private Cloud. Look out for a new article on Spark as well, with our new high RAM servers, we now have a standard bare metal box for Spark.