Hadoop Distributed File System (HDFS) Architectural Documentation - Module View

 

1        Module View

In this section we describe the modular structure of the HDFS, version 0.21.

This structure is derived from a static analysis of the source code, specifically focusing on the dependencies between classes and groups of classes (modules). This structure is shown in the figure below.

[Module view]

1.1      Modularity Risks

With development of the software over time, all source code inevitably drifts away from it’s initial "as-designed" structure. This section identifies four signals--characteristics of the code--that suggest that the source code has evolved to a less modular structure:

When these signals are applied to the module structure, it appears that the HDFS could be made more modular. Each of these refactoring opportunities is now discussed, presented per module.

1.1.1.1     hdfs

Because the hdfs package now contains code that is used by both the client and the server, the package hdfs should be split into two: hdfs.client and hdfs.common. The hdfs.common package can contain all code that is shared by both the client and server modules, while the client would contain just the code necessary for the client. This division could look as follows:

hdfs.client

hdfs.common

BlockMissingException.java
DFSClient.java
DFSInputStream.java
DFSOutputStream.java
ByteRangeInputStream.java
DFSClient.java
HftpFileSystem.java
HsftpFileSystem.java

BlockReader.java
DeprecatedUTF8.java
DFSConfigKeys.java
DFSUtil.java
DistributedFileSystem.java
HdfsConfiguration.java
HDFSPolicyProvider.java

 

Currently, the default port numbers that the NameNode and DataNode run with are stored in the server.namenode and server.datanode packages respectively. If they would be stored in hdfs.common instead, servers that want to communicate with either the namenode or datanode server would not need a dependency on that server's package.

1.1.1.2     hfds.security

security.token.delegation.DelegationTokenSecretManager depends on server.namenode.FSNameSystem, while the security code is used by other servers than the namenode. This could be refactored so that FSNameSystem is called from the namenode rather than the security module, removing a dependency.

1.1.1.3     hdfs.protocol

The class BlockListAsLongs depends on ReplicaInfo in the server.datanode module, which looks like an unhealthy dependency, given that hdfs.protocol is used by all servers rather than just the datanode server. Building the block list is a task that is better performed in the hdfs.server.datanode module.

1.1.1.4     hdfs.server.protocol

The server.protocol package depends on two classes in server.common that protocol just uses for defined constants. It seems that it would be better to store these constants in the server.protocol package, as they (proven by their use in server.protocol) define the communication between servers. There are also dependencies from hdfs.server.protocol to hdfs.server.datanode (in protocol.DataNodeRegistration) and hdfs.server.namenode (in protocol.CheckpointCommand). These dependencies exist because hdfs.server.protocol contains code to fill its protocol messages from these classes. It would remove the dependencies from protocol on these classes if datanode and namenode themselves would be responsible for filling in the protocol messages.

1.1.1.5     server.common

IncorrectVersionException and InconsistentFSStateException would probably fit better in server.protocol. JspHelper depends on namenode; the function that uses it (JspHelper.sortNodeList) can be moved to the namenode package, since it's not relevant for other servers.

1.1.1.6     hdfs.server.namenode

server.namenode depends on hdfs.DFSClient to create servlets. It appears that this code could be refactored to be put into hdfs.common. The class namenode.FSNameSystem is involved in multiple cyclic dependencies. It has a direct cyclic dependency with namenode.NameNode, namenode.FSNameSystemMetrics and namenode.LeaseManager, and there are indirect cyclic dependencies on more classes (for example UpgradeObjectNamenode, UpgradeManagerNamenode.

1.1.1.7     hdfs.server.datanode

server.datanode also depends on hdfs.DFSClient.  Putting this code in common would be a good refactoring opportunity.

1.1.1.8     hdfs.server.balancer

server.balancer also depends on hdfs.DFSClient, again input from the community is greatly appreciated on if putting this code in common would be a good refactoring opportunity. Another refactoring possibility for the balancer is to remove the dependency on the namenode. The classes from namenode that the balancer depends on are:

  • namenode.UnsupportedActionException, which could be moved to protocol, since it's a shared message between namenode and balancer
  • namenode.Namenode, on which it only depends to get the namenode's port number, which could be stored in the common package.
  • namenode.BlockPlacementPolicy, on which it depends to check if the block placement policy of the balancer matches the policy of the namenode. This check could be done through a protocol message in server.protcol as well.

Removing the dependency on namenode would make balancer a fully separate server, and would allow it to perform at the same level as datanode and namenode. The class server.balancer.Balancer contains several cyclic dependencies, however, they are all within classes in the same source file. This means the effect of the dependencies is likely less severe, but refactoring the dependency structure of this class could still be an opportunity to increase modularity.

1.1.1.9     hdfs.tools

tools consists of a couple of different components that have low coupling between them. But because they all provide functionality that falls somewhat outside the main domain of a filesystem (debugging and administrative tools), it makes sense to keep them together in one package for the user's convenience.