Sangala Shekhar Reddy: August 2017

Tuesday, August 29, 2017

Set and Get data from etcd using java api

pom.xml:

org.mousio
etcd4j
2.13.0

java code:
EtcdClient etcd = new EtcdClient(URI.create("http://localhost:2379"));
EtcdKeyGetRequest req=etcd.get("sai");
EtcdKeysResponse response1 =etcd.get("sai").send().get();
System.out.println(response1.node.value);

EtcdKeysResponse response = etcd.put("foo", "bar").send().get();
System.out.println(response.node.value);

Read data from ETCD using java api

pom.xml:

org.mousio
etcd4j
2.13.0

java code to read data:

EtcdClient etcd = new EtcdClient(URI.create("http://localhost:2379"));
EtcdResponsePromise rs=etcd.getDir("/sai1/ram").recursive().send();
EtcdKeysResponse response1 = rs.get();
EtcdNode en=response1.node;
List len=en.getNodes();
for (EtcdNode etcdNode : len) {
System.out.println("key=" + etcdNode.key + " : value="+etcdNode.value);
}

Thursday, August 17, 2017

Data node maintenance without triggering hdfs re-balancing

By default, if a Datanode doesn't check in for 10.5 minutes [1], that datanode is not responding the following steps are followed [2] * NameNode determines which blocks were on the failed DataNode. * NameNode locates other DataNodes with copies of these blocks. * The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor. Depending on the replica blocks held on that Datanode, would automatically replicate copies to match the replica=3 as mentioned above. dfs.namenode.heartbeat.recheck-interval (300000 ms by default) + 10 * dfs.heartbeat.interval (3 s by default). You can increase the dfs.namenode.heartbeat.recheck-interval in the "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" in Cloudera Manager and then rolling restart the NameNodes for the change to take effect. Also a better solution would be the HDFS Maintenance State[3], but for this you would need to upgrade your CDH version if our records are up to date as this feature has been introduced in CDH 5.11. Increasing this value is not usually recommended as it adds risk to maintaining replica count when a datanode is not available. I would also recommend returning the initial node back to service prior to moving to the next node to limit the amount of Cluster IO/ and Network I/O to replicate blocks from node in maintenance. LINKS: [1] dfs.namenode.stale.datanode.interval => Default time interval for marking a datanode as "stale", i.e., if the namenode has not received heartbeat msg from a datanode for more than this time interval, the datanode will be marked and treated as "stale" by default. The stale interval cannot be too small since otherwise this may cause too frequent change of stale states. We thus set a minimum stale interval value (the default value is 3 times of heartbeat interval) and guarantee that the stale interval cannot be less than the minimum value. A stale data node is avoided during lease/block recovery. It can be conditionally avoided for reads (see dfs.namenode.avoid.read.stale.datanode) and for writes (see dfs.namenode.avoid.write.stale.datanode). [2] https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_dn.html#concept_y12_knh_m4 [3] https://blog.cloudera.com/blog/2017/05/hdfs-maintenance-state/

Wednesday, August 16, 2017

Procedure to Stop and Start MariaDB Galera cluster

Procedure to Stop and Start MariaDB Galera cluster

a. Determine which MariaDB instance (cw01-03) is the most advanced node state ID (ie. largest seqno value) & start this node first
i. For each node in the Galera cluster (cw01-03), get the value of the attribute seqno in the file /var/lib/mysql/grastate.dat
ii. If no file has seqno>0
1. For each node in the Galera cluster run the following command (as root) to get the last valid sequence:
# mysqld_safe --wsrep-recover
2. Locate the value of the parameter seqno in the output
iii. Determine the node with the largest seqno value

b. Start/bootstrap the ‘mysql’ service with the most advanced state (ie. largest seqno value)
# Log onto CW0?
# mv /etc/init.d/mysql.ORIG /etc/init.d/mysql
# service mysql bootstrap

c. Start the ‘mysql’ service for the other 2 CWs
# Log onto CW0?
# mv /etc/init.d/mysql.ORIG /etc/init.d/mysql
# service mysql start
#
# Log onto CW0?
# mv /etc/init.d/mysql.ORIG /etc/init.d/mysql
# service mysql start

Monday, August 14, 2017

Procedure to update Journal node timeout

Procedure to update Journal node timeout

1. Add below property in hdfs-site.xml file where NameNode installed (mostly core worker 1 and core worker 2)

dfs.qjournal.write-txns.timeout.ms
30000

2. Restart Standby NameNode. Standby NameNode should come up and running
service hadoop-hdfs-namenode stop
service hadoop-hdfs-namenode start
service hadoop-hdfs-namenode status

3. Restart active NameNode if Standby is Running
service hadoop-hdfs-namenode stop
service hadoop-hdfs-namenode start
service hadoop-hdfs-namenode status

Monday, August 7, 2017

How to cleanup flume checkpoints

Instruction on how to cleanup flume checkpoints:

1. stop flume
2. check the process and make sure it is stopped (ps -eaf | grep flume )
3. clean logs or move them to backup-log directory
4. delete checkpoint files from /data/dcm0[1-5]/flume directory
find . -type f -print | xargs rm -rf
5. start flume service
6. monitor flume logs