Hadoop Distributed File System (HDFS) Architectural Documentation - Overview


3         Overview of the HDFS Architecture

This section provides a quick overview of the architecture of HDFS. The material in here is elaborated in other sections. The figure below gives a run-time view of the architecture showing three types of address spaces: the application, the NameNode and the DataNode. An essential portion of HDFS is that there are multiple instances of DataNode.

HDFS Runtime View

The application incorporates the HDFS client library into its address space. The client library manages all communication from the application to the NameNode and the DataNode. An HDFS cluster consists of a single NameNode—a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per computer node in the cluster, which manage storage attached to the nodes that they run on.

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

3.1       HDFS Files

There is a distinction between an HDFS file and a native (Linux) file on the host computer. A computer in an HDFS installation is (typically) allocated to one NameNode or one DataNode. Each computer has its own file system and information about an HDFS file—the metadata—is managed by the NameNode and persistent information is stored in the NameNode’s host file system. The information contained in an HDFS file is managed by a DataNode and stored on the DataNode’s host computer file system.

HDFS exposes a file system namespace and allows user data to be stored in HDFS files. An HDFS file consists of a number of blocks. Each block is typically 64MByes. Each block is replicated some specified number of times. The replicas of the blocks are stored on different DataNodes chosen to reflect loading on a DataNode as well as to provide both speed in transfer and resiliency in case of failure of a rack. See Block Allocation for a description of the allocation algorithm.

A standard directory structure is used in HDFS. That is, HDFS files exist in directories that may in turn be sub-directories of other directories, and so on. There is no concept of a current directory within HDFS. HDFS files are referred to by their fully qualified name which is a parameter of many of the elements of the interaction between the Client and the other elements of the HDFS architecture.        

The NameNode executes HDFS file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The list of HDFS files belonging to each block, the current location of the block replicas on the DataNodes, the state of the file, and the access control information is the metadata for the cluster and is managed by the NameNode.

The DataNodes are responsible for serving read and write requests from the HDFS file system’s clients. The DataNodes also perform block replica creation, deletion, and replication upon instruction from the NameNode. The DataNodes are the arbiter of the state of the replicates and they report this to the NameNode.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The client sends data directly to and reads directly from DataNodes so that client data never flows through the NameNode.

3.2       Block Allocation

Each block is replicated some number of times—the default replication factor for HDFS is three. When addBlock() is invoked, space is allocated for each replica. Each replica is allocated on a different DataNode. The algorithm for performing this allocation attempts to balance performance and reliability. This is done by considering the following factors:

·         The location of the DataNodes. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

·         For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of a node failure; therefore this co-location policy does not adversely impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node on some rack; the other two thirds of replicas are on distinct nodes one a different rack. This policy improves write performance without compromising data reliability or read performance.

The figure below shows how blocks are replicated on different DataNodes.

HDFS Block Replication

Blocks are linked to the file through INode. Each block is given a timestamp that is used to determine whether a replica is current. We discuss this further in the Block and Replica Management section.