Hadoop Distributed File
System (HDFS) Architectural Documentation - Module View
In this section we describe the modular structure of the HDFS, version 0.21.
This structure is derived from a static
analysis of the source code, specifically focusing on the
dependencies between classes and groups of classes (modules).
This structure is shown in the figure below.
With development of the software over time, all source code inevitably drifts away from it’s initial "as-designed" structure. This section identifies four signals--characteristics of the code--that suggest that the source code has evolved to a less modular structure:
When these signals are applied to the
module structure, it appears that the HDFS could be made
more modular. Each of these refactoring opportunities is now
discussed, presented per module.
Because the hdfs package now contains
code that is used by both the client and the server, the
package hdfs should be split into two: hdfs.client and
hdfs.common. The hdfs.common package can contain all code
that is shared by both the client and server modules, while
the client would contain just the code necessary for the
client. This division could look as follows:
hdfs.client |
hdfs.common |
BlockMissingException.java |
BlockReader.java |
Currently, the default port numbers that
the NameNode and DataNode run with are stored in the
server.namenode and server.datanode packages respectively.
If they would be stored in hdfs.common instead, servers that
want to communicate with either the namenode or datanode
server would not need a dependency on that server's package.
security.token.delegation.DelegationTokenSecretManager
depends
on server.namenode.FSNameSystem, while the security code is
used by other servers than the namenode. This could be
refactored so that FSNameSystem is called from the namenode
rather than the security module, removing a dependency.
The class BlockListAsLongs depends on
ReplicaInfo in the server.datanode module, which looks like
an unhealthy dependency, given that hdfs.protocol is used by
all servers rather than just the datanode server. Building
the block list is a task that is better performed in the
hdfs.server.datanode module.
The server.protocol package depends on
two classes in server.common that protocol just uses for
defined constants. It seems that it would be better to store
these constants in the server.protocol package, as they
(proven by their use in server.protocol) define the
communication between servers. There are also dependencies
from hdfs.server.protocol to hdfs.server.datanode (in
protocol.DataNodeRegistration) and hdfs.server.namenode (in
protocol.CheckpointCommand). These dependencies exist
because hdfs.server.protocol contains code to fill its
protocol messages from these classes. It would remove the
dependencies from protocol on these classes if datanode and
namenode themselves would be responsible for filling in the
protocol messages.
IncorrectVersionException and
InconsistentFSStateException would probably fit better in
server.protocol. JspHelper depends on namenode; the function
that uses it (JspHelper.sortNodeList) can be moved to the
namenode package, since it's not relevant for other servers.
server.namenode
depends on hdfs.DFSClient to create servlets. It appears
that this code could be refactored to be put into
hdfs.common. The class namenode.FSNameSystem is involved
in multiple cyclic dependencies. It has a direct cyclic
dependency with namenode.NameNode,
namenode.FSNameSystemMetrics and namenode.LeaseManager,
and there are indirect cyclic dependencies on more classes
(for example UpgradeObjectNamenode,
UpgradeManagerNamenode.
server.datanode
also
depends on hdfs.DFSClient. Putting this
code in common would be a good refactoring
opportunity.
server.balancer also depends on hdfs.DFSClient, again input from the community is greatly appreciated on if putting this code in common would be a good refactoring opportunity. Another refactoring possibility for the balancer is to remove the dependency on the namenode. The classes from namenode that the balancer depends on are:
Removing the dependency on namenode would
make balancer a fully separate server, and would allow it to
perform at the same level as datanode and namenode. The
class server.balancer.Balancer contains several cyclic
dependencies, however, they are all within classes in the
same source file. This means the effect of the dependencies
is likely less severe, but refactoring the dependency
structure of this class could still be an opportunity to
increase modularity.
tools consists of a couple of different components that have low coupling between them. But because they all provide functionality that falls somewhat outside the main domain of a filesystem (debugging and administrative tools), it makes sense to keep them together in one package for the user's convenience.