Sangala Shekhar Reddy: Hadoop NameNode down or crash randomly

Monday, July 31, 2017

Hadoop NameNode down or crash randomly

ERROR /Symptoms:
2017-07-27 18:50:32,405 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 5605ms to send a batch of 12 ed
its (841 bytes) to remote journal 206.46.37.113:8485
2017-07-27 18:50:38,693 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 6212ms to send a batch of 18 ed
its (1245 bytes) to remote journal 206.46.37.113:8485
2017-07-27 18:50:48,318 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 9492ms to send a batch of 4 edi
ts (1049 bytes) to remote journal 206.46.37.112:8485
2017-07-27 18:51:16,909 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 11283ms to send a batch of 8 ed
its (1672 bytes) to remote journal 206.46.37.112:8485
2017-07-27 18:51:22,765 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 206.46.37.114:8485 fa
iled to write txns 32683796-32683796. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 34 is less than the last promised epoch 35
…..
…..
2017-07-27 18:51:30,993 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [206.46.37.112:8485, 206.46.37.113:8485, 206.46.37.114:8485], stream=QuorumOutputStream starting at txid 32683792))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
206.46.37.114:8485: IPC's epoch 34 is less than the last promised epoch 35

Workaround:
dfs.qjournal.write-txns.timeout.ms default value is 20000 - Write timeout in milliseconds when writing to a quorum of remote journals.

increase dfs.qjournal.write-txns.timeout.ms default value to 30000

Procedure to apply dfs.qjournal.write-txns.timeout.ms property
1. Add below property in hdfs-site.xml file where NameNode installed (mostly core worker 1 and core worker 2)

dfs.qjournal.write-txns.timeout.ms
30000

2. Restart Standby NameNode. Standby NameNode should come up and running
service hadoop-hdfs-namenode stop
service hadoop-hdfs-namenode start
service hadoop-hdfs-namenode status

3. Restart active NameNode if Standby is Running
service hadoop-hdfs-namenode stop
service hadoop-hdfs-namenode start
service hadoop-hdfs-namenode status

2 comments:

sreen said...: Good articles; March 7, 2023 at 10:46 PM
sreen said...: Could you please share your contact details, would like to connect with you.; March 7, 2023 at 10:50 PM