Sangala Shekhar Reddy: Decommissioning a Datanode is Taking a Long Time or is Hanging

Cause
DataNode decommissioning is the act of preventing new data to be written to a DataNode while safely moving
the current data to another DataNode within the cluster.

This article contains solutions for common problems with DataNode decommissioning:

Low bandwidth
Long delays with decommissioning open files
Hangs with decommissioning files that have replication factor above the number of remaining nodes in the cluster.

Below is a list of common decommissioning problems and their solutions.

Decommissioning takes a long time, while minimal bandwidth is being used.

The bandwidth is limited by default in order to prevent decommissioning from overloading DataNodes.
Cloudera recommends changing the defaults to allow for a faster decommission.

Note: This will increase the memory consumption of the Datanodes.

Solution: Increase default bandwidth for decommissioning

This requires higher memory usage at the DataNodes due to tracking more parallel block copies.
Increasing the decommissioning bandwidth also requires changing the configuration and restarting HDFS.

If using Cloudera Manager:

Navigate to: HDFS > Configuration (View/Edit).
Using the left-sidebar, navigate to: NameNode > Advanced > "NameNode Configuration Safety Valve for hdfs-site.xml" and paste in both XML properties (from above).
Using the left-sidebar, navigate to: DataNode > Advanced > "Java Heap Size of DataNode in Bytes" and make sure the value is set to at least 4 GB.
Save the changes.
Restart HDFS (NameNodes and DataNodes) to set the change into effect (A rolling restart may also be performed to reduce downtime).

If not using Cloudera Manager:

Apply the following XML properties into the NameNodes' hdfs-site.xml and save the changes.
In the DataNode's hadoop-env.sh file, ensure the HADOOP_DATANODE_OPTS or the HADOOP_HEAPSIZE env-var properties carry a minimum of 4 GB heap.

(If not, add "-Xmx4g" to the HADOOP_DATANODE_OPTS property (at the end) and save the change. This should be ensured for all DataNodes in the cluster.)
Restart the HDFS cluster to set the change into effect (A rolling restart can be performed to reduce downtime).

dfs.namenode.replication.work.multiplier.per.iteration

10

dfs.namenode.replication.max-streams

100

Open Files

Because the DataNode writes do not involve the NameNode, if there are blocks associated with open files located on a DataNode, they are not relocated until the file is closed.

This commonly occurs with:

Clusters with HBase
Open Flume files
Long running tasks

Solution: Locate and close any open files.

Find the NameNode log directory

cd /var/hadoop-hdfs/
Verify that the logs provide the needed information.

grep "Is current datanode" *NAME* | head
The sixth column shows the block id, and the message should be relevant to DataNode decommissioning.

grep "Is current datanode" *NAME* | awk '{print $6}' | sort -u > blocks.open
Provide a list of open files, their blocks, and the locations of those blocks.

hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
Check in openfiles.out for the blocks in blocks.open, also verifying the DataNode IP address is correct.
Now that the name of the open file(s) is known, perform the appropriate action to restart the process to close the file.

For example, major compaction will close all files in a region for HBase.

A block cannot be relocated due to there not being enough nodes to satisfy the block placement policy

For example, for a 10 node cluster, if mapred.submit.replication is set to the default of 10 while attempting to decommission one

DataNode, there will be difficulties relocating blocks that are associated with map/reduce jobs.

This condition will lead to the similar errors in the NameNode logs:

org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault: Not able to place enough replicas, still in need of 3 to reach 3

Solution: Find the files that have blocks with replications higher than the number of DataNodes remaining in the cluster.

Use the following steps to find the number of files with replication equal to or above your current cluster size:

Provide a listing of open files, their blocks, the locations of those blocks.
You can reuse the output from a previous fsck for this solution.

hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
The following command will give a list of how many files have a given replication factor:

grep repl= openfiles.out | awk '{print $NF}' | sort | uniq -c
Using the example of 10 nodes, and decommissioning one:

egrep -B4 "repl=10" openfiles.out | grep -v '' | awk '/^\//{print $1}'
Examine the paths, and decide whether to reduce the replication factor of the files, or remove them from the cluster.

Shutting down the DataNode before decommissioning finishes

HDFS is designed to be resilient to node/device failures. If data replication recommendations are followed, shutting down one DataNode will not affect the data within the cluster.
Voluntarily removing a DataNode does provide a window of time when replication is lower than recommended, increasing the potential risk of data loss.

Solution: Determine if any files are in risk of data loss before abruptly shutting down the DataNode.

List the open files.

hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
Examine the paths and decide whether to remove them from the cluster.

egrep -B4 "repl=1 " openfiles.out | grep -v '' | awk '/^\//{print $1}'
Look at the listing of files to see if they have blocks located on the DataNodes being decommissioned.
Make a decision whether those files are required.

Sangala Shekhar Reddy

Friday, April 27, 2018

Decommissioning a Datanode is Taking a Long Time or is Hanging | Replication