Sangala Shekhar Reddy: November 2017

Wednesday, November 15, 2017

How many blocks allocated for a file in HDFS Hadoop

Run below command to know how many blocks allocated for file in HDFS. Here hdfs block size is 64 MB.

$ sudo -u hdfs hdfs fsck /path/filename -files -blocks

Ex: /shekhar/tab4.csv file size in hdfs is 320.1 MB
$ sudo -u hdfs hdfs fsck /shekhar/tab4.csv -files -blocks

[root@shekhar-server2 tmp]#
[root@shekhar-server2 tmp]# sudo -u hdfs hdfs fsck /shekhar/tab4.csv -files -blocks
Connecting to namenode via http://shekhar-server2.openstacklocal:50070
FSCK started by hdfs (auth:SIMPLE) from /10.194.10.14 for path /shekhar/tab4.csv at Thu Nov 16 05:40:35 IST 2017
/shekhar/tab4.csv 335600820 bytes, 6 block(s): OK
0. BP-1971872654-10.194.10.14-1504721645808:blk_1073752536_11792 len=67108864 repl=3
1. BP-1971872654-10.194.10.14-1504721645808:blk_1073752537_11793 len=67108864 repl=3
2. BP-1971872654-10.194.10.14-1504721645808:blk_1073752538_11794 len=67108864 repl=3
3. BP-1971872654-10.194.10.14-1504721645808:blk_1073752539_11795 len=67108864 repl=3
4. BP-1971872654-10.194.10.14-1504721645808:blk_1073752540_11796 len=67108864 repl=3
5. BP-1971872654-10.194.10.14-1504721645808:blk_1073752541_11797 len=56500 repl=3

Status: HEALTHY
Total size: 335600820 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 6 (avg. block size 55933470 B)
Minimally replicated blocks: 6 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Thu Nov 16 05:40:35 IST 2017 in 1 milliseconds

The filesystem under path '/shekhar/tab4.csv' is HEALTHY
[root@shekhar-server2 tmp]#

Tuesday, November 14, 2017

Java Volatile versus synchronized

Volatile variables

In other words, if a volatile variable is updated such that, under the hood, the value is read, modified, and then assigned a new value, the result will be a non-thread-safe operation performed between two synchronous operations. You can then decide whether to use synchronization or rely on the JRE's support for automatically synchronizing volatile variables. The better approach depends on your use case: If the assigned value of the volatile variable depends on its current value (such as during an increment operation), then you must use synchronization if you want that operation to be thread safe.

To fully understand what the volatile keyword does, it's first helpful to understand how threads treat non-volatile variables.

In order to enhance performance, the Java language specification permits the JRE to maintain a local copy of a variable in each thread that references it. You could consider these "thread-local" copies of variables to be similar to a cache, helping the thread avoid checking main memory each time it needs to access the variable's value.

But consider what happens in the following scenario: two threads start and the first reads variable A as 5 and the second reads variable A as 10. If variable A has changed from 5 to 10, then the first thread will not be aware of the change, so it will have the wrong value for A. If variable A were marked as being volatile, however, then any time a thread read the value of A, it would refer back to the master copy of A and read its current value.

If the variables in your applications are not going to change, then a thread-local cache makes sense. Otherwise, it's very helpful to know what the volatile keyword can do for you.

Volatile versus synchronized:

If a variable is declared as volatile, it means that it is expected to be modified by multiple threads. Naturally, you would expect the JRE to impose some form of synchronization for volatile variables. As luck would have it, the JRE does implicitly provide synchronization when accessing volatile variables, but with one very big caveat: reading a volatile variable is synchronized and writing to a volatile variable is synchronized, but non-atomic operations are not.

What this means is that the following code is not thread safe:

myVolatileVar++;

The previous statement could also be written as follows:

int temp = 0;

synchronize( myVolatileVar ) {

  temp = myVolatileVar;

}

temp++;

synchronize( myVolatileVar ) {

  myVolatileVar = temp;

}

Hadoop Yarn Resource Manager Tuning (Add new queue to Yarn Resource Manager)

Need of Yarn queues: If you have some priority jobs and you don't want to effect these jobs execution because of other jobs. Then create new queue, give right weight to that queue and assign queue to your jobs

This is a procedure on how to add new queue to Yarn Resource Manager and assigning jobs to this queue.

Step1: add below lines in the file /etc/hadoop/conf/fair-scheduler.xml in core nodes where hadoop-yarn-resourcemanager is running

<allocations> ..... <queue name="newqueue"> <maxrunningapps>50</maxrunningapps> <weight>6</weight> <schedulingpolicy>fair</schedulingpolicy> <fairsharepreemptiontimeout>30</fairsharepreemptiontimeout> </queue> ..... </allocations>

Step2:Restart hadoop-yarn-resourcemanager service on all applicable nodes
service hadoop-yarn-resourcemanager restart

Step3: (OPTIONAL)
10000 mb,10 vcores
Can also be added to queue definition to set the minimum amount of resources allocated to this queue. These resources cannot be used by other queues
Step4: For Hive jobs. set below property to use new queue
set mapreduce.job.queuename=newqueue;

You can see new queue in as below.

Tuesday, November 7, 2017

Procedure to change zookeeper dataDir and dataLogDir

Procedure to change zookeeper dataDir and dataLogDir:
=======================================
Symptoms: if you see "fsync write ahead log took long time" message in zookeeper logs or zookeeper client time out in zkfc,name node,hive,journal node...etc service logs then We should have dedicated disk for dataDir for batter zookeeper performance.

Steps to change Zookeeper directory.

1. Shutdown one Zookeeper at a time.

2. Copy the dataDir and dataLogDir data from current data into the new directory path.

3. Ensure that the ownership of the contents are still zookeeper:zookeeper

4. Update dataDir and dataLogDir path in /etc/zookeeper/conf/zoo.cfg file
vi /etc/zookeeper/conf/zoo.cfg
dataDir=
dataLogDir=

5. Start only the Zookeeper that was stopped in step 1.

6. Wait for it to become "FOLLOWER"
ex: checking usnig command echo stat | nc localhost 2181 | grep Mode
# echo stat | nc localhost 2181 | grep Mode
Mode: follower
7. Repeat the Steps for the other two Zookeeper hosts.