Wednesday, November 15, 2017

How many blocks allocated for a file in HDFS Hadoop

Run below command to know how many blocks allocated for file in HDFS. Here hdfs block size is 64 MB.

$ sudo -u hdfs hdfs fsck /path/filename -files -blocks

Ex: /shekhar/tab4.csv file size in hdfs is 320.1 MB
$ sudo -u hdfs hdfs fsck /shekhar/tab4.csv -files -blocks

[root@shekhar-server2 tmp]#
[root@shekhar-server2 tmp]# sudo -u hdfs hdfs fsck /shekhar/tab4.csv -files -blocks
Connecting to namenode via http://shekhar-server2.openstacklocal:50070
FSCK started by hdfs (auth:SIMPLE) from /10.194.10.14 for path /shekhar/tab4.csv at Thu Nov 16 05:40:35 IST 2017
/shekhar/tab4.csv 335600820 bytes, 6 block(s):  OK
0. BP-1971872654-10.194.10.14-1504721645808:blk_1073752536_11792 len=67108864 repl=3
1. BP-1971872654-10.194.10.14-1504721645808:blk_1073752537_11793 len=67108864 repl=3
2. BP-1971872654-10.194.10.14-1504721645808:blk_1073752538_11794 len=67108864 repl=3
3. BP-1971872654-10.194.10.14-1504721645808:blk_1073752539_11795 len=67108864 repl=3
4. BP-1971872654-10.194.10.14-1504721645808:blk_1073752540_11796 len=67108864 repl=3
5. BP-1971872654-10.194.10.14-1504721645808:blk_1073752541_11797 len=56500 repl=3

Status: HEALTHY
 Total size:    335600820 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      6 (avg. block size 55933470 B)
 Minimally replicated blocks:   6 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Thu Nov 16 05:40:35 IST 2017 in 1 milliseconds

The filesystem under path '/shekhar/tab4.csv' is HEALTHY
[root@shekhar-server2 tmp]#


Tuesday, November 14, 2017

Java Volatile versus synchronized

Volatile variables

In other words, if a volatile variable is updated such that, under the hood, the value is read, modified, and then assigned a new value, the result will be a non-thread-safe operation performed between two synchronous operations. You can then decide whether to use synchronization or rely on the JRE's support for automatically synchronizing volatile variables. The better approach depends on your use case: If the assigned value of the volatile variable depends on its current value (such as during an increment operation), then you must use synchronization if you want that operation to be thread safe.


To fully understand what the volatile keyword does, it's first helpful to understand how threads treat non-volatile variables.
In order to enhance performance, the Java language specification permits the JRE to maintain a local copy of a variable in each thread that references it. You could consider these "thread-local" copies of variables to be similar to a cache, helping the thread avoid checking main memory each time it needs to access the variable's value.
But consider what happens in the following scenario: two threads start and the first reads variable A as 5 and the second reads variable A as 10. If variable A has changed from 5 to 10, then the first thread will not be aware of the change, so it will have the wrong value for A. If variable A were marked as being volatile, however, then any time a thread read the value of A, it would refer back to the master copy of A and read its current value.
If the variables in your applications are not going to change, then a thread-local cache makes sense. Otherwise, it's very helpful to know what the volatile keyword can do for you.
Volatile versus synchronized:
If a variable is declared as volatile, it means that it is expected to be modified by multiple threads. Naturally, you would expect the JRE to impose some form of synchronization for volatile variables. As luck would have it, the JRE does implicitly provide synchronization when accessing volatile variables, but with one very big caveat: reading a volatile variable is synchronized and writing to a volatile variable is synchronized, but non-atomic operations are not.
What this means is that the following code is not thread safe:
1
myVolatileVar++;
The previous statement could also be written as follows:
1
2
3
4
5
6
7
8
9
10
int temp = 0;
synchronize( myVolatileVar ) {
  temp = myVolatileVar;
}
 
temp++;
 
synchronize( myVolatileVar ) {
  myVolatileVar = temp;
}


Hadoop Yarn Resource Manager Tuning (Add new queue to Yarn Resource Manager)

Need of Yarn queues: If you have some priority jobs and you don't want to effect these jobs execution because of  other jobs. Then create new queue,  give right weight to that queue and assign queue to your jobs

This is a procedure on how to add new queue to Yarn Resource Manager and assigning jobs to this queue.

Step1:  add below lines in the file /etc/hadoop/conf/fair-scheduler.xml in core nodes where hadoop-yarn-resourcemanager is running

<allocations> ..... <queue name="newqueue"> <maxrunningapps>50</maxrunningapps> <weight>6</weight> <schedulingpolicy>fair</schedulingpolicy> <fairsharepreemptiontimeout>30</fairsharepreemptiontimeout> </queue> ..... </allocations>



Step2:Restart hadoop-yarn-resourcemanager service on all applicable nodes
service hadoop-yarn-resourcemanager restart

Step3: (OPTIONAL)
10000 mb,10 vcores
Can also be added to queue definition to set the minimum amount of resources allocated to this queue. These resources cannot be used by other queues
Step4: For Hive jobs. set below property to use new queue
set mapreduce.job.queuename=newqueue;

You can see new queue in as below.




Tuesday, November 7, 2017

Procedure to change zookeeper dataDir and dataLogDir

Procedure to change zookeeper dataDir and dataLogDir:
=======================================
Symptoms: if you see "fsync write ahead log took long time" message in zookeeper logs or zookeeper client time out in zkfc,name node,hive,journal node...etc service logs  then We should have dedicated disk for dataDir for batter zookeeper performance.

Steps to change Zookeeper directory.

1. Shutdown one Zookeeper at a time.

2. Copy the dataDir and dataLogDir data from current data into the new directory path.

3. Ensure that the ownership of the contents are still zookeeper:zookeeper

4. Update dataDir and dataLogDir path in /etc/zookeeper/conf/zoo.cfg file
vi /etc/zookeeper/conf/zoo.cfg
dataDir=
dataLogDir=

5. Start only the Zookeeper that was stopped in step 1.

6. Wait for it to become "FOLLOWER"
ex: checking usnig command echo stat | nc localhost 2181 | grep Mode
#  echo stat | nc localhost 2181 | grep Mode
Mode: follower
7. Repeat the Steps for the other two Zookeeper hosts.

Sunday, October 22, 2017

Alter Kafka topic partition

Steps to change Kafka Partitions

1. Create topic demo_topic13 with partitions 6

/opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 6 --topic demo_topic13

2. Describe topic before alter topic
[root@node1 ~]# /opt/kafka/bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic
demo_topic13
Topic:demo_topic13      PartitionCount:6        ReplicationFactor:2     Configs:
        Topic: demo_topic13     Partition: 0    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 1    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297
        Topic: demo_topic13     Partition: 2    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 3    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297
        Topic: demo_topic13     Partition: 4    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 5    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297

3. alter partition with 7 (note: you con't decrease  partitions)
[root@node1 ~]# /opt/kafka/bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic demo_topic13 --partitions 7
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!

4. describe topic and check
[root@node1 ~]# /opt/kafka/bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic demo_topic13
Topic:demo_topic13      PartitionCount:7        ReplicationFactor:2     Configs:
        Topic: demo_topic13     Partition: 0    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 1    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297
        Topic: demo_topic13     Partition: 2    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 3    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297
        Topic: demo_topic13     Partition: 4    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812
        Topic: demo_topic13     Partition: 5    Leader: 2812    Replicas: 2812,3297     Isr: 2812,3297
        Topic: demo_topic13     Partition: 6    Leader: 3297    Replicas: 3297,2812     Isr: 3297,2812


Saturday, October 14, 2017

Hadoop Namenode and ZKFC tuning


Separate port(8022) for internal communication Changing handler count for better Name node performance:

Set below parameter in hdfs-site.xml

<property> <name>dfs.namenode.servicerpc-address.mas.NomeNodeHost1-com</name> <value>NomeNodeHost1.com:8022</value> </property> <property> <name>dfs.namenode.servicerpc-address.mas.NomeNodeHost1-com</name> <value>NomeNodeHost2.com:8022</value> </property>
For 15 data nodes: ln(number of DNs)*20= 2.708*20 = 54
<property> <name>dfs.namenode.handler.count</name> <value>55</value> </property> <property> <name>dfs.namenode.service.handler.count</name> <value>55</value> <!-- For internal service communications for 8022 -> </property> Steps to refresh cluster with above changes:
1. stop hadoop-hdfs-zkfc in both systems
service hadoop-hdfs-zkfc stop
2. remove hadoop-ha in zookeeper
$zookeeper-client
>rmr /hadoop-ha
3. create hadoop-ha using below command
sudo -u hdfs hdfs zkfc -formatZK
4. start both zkfc services
service hadoop-hdfs-zkfc start
5. start name node

Tuesday, September 26, 2017

Apache Kafka with SSL Confguration

Kafka with SSL:

- Generate SSL key and certificate for each Kafka broker
keytool -keystore kafka.server.keystore.jks -alias localhost -validity 365 -genkey
output: kafka.server.keystore.jks

- Creating your own CA
openssl req -new -x509 -keyout ca-key -out ca-cert -days 365
output: ca-cert, ca-key

- The next step is to add the generated CA to the clients’ truststore so that the clients can trust this CA:
keytool -keystore kafka.server.truststore.jks -alias CARoot -import -file ca-cert

- If you configure the Kafka brokers to require client authentication by setting ssl.client.auth to requested or required on the broker config then you must also provide a truststore for the Kafka brokers and it should have all the CA certificates that clients keys were signed by.
keytool -keystore kafka.client.truststore.jks -alias CARoot -import -file ca-cert


- The next step is to sign all certificates in the keystore with the CA we generated.
  1. First, you need to export the certificate from the keystore:
    keytool -keystore kafka.server.keystore.jks -alias localhost -certreq -file cert-file
  2. Then sign it with the CA:
openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days 365 -CAcreateserial -passin pass:test1234

- Finally, you need to import both the certificate of the CA and the signed certificate into the keystore:
keytool -keystore kafka.server.keystore.jks -alias CARoot -import -file ca-cert
keytool -keystore kafka.server.keystore.jks -alias localhost -import -file cert-signed


server.properties:
listeners=PLAINTEXT://node1.openstacklocal:9092,SSL://node1.openstacklocal:9094
security.protocol=SSL
ssl.keystore.location=/home/kafka/ssl/kafka.server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/home/kafka/ssl/kafka.server.truststore.jks
ssl.truststore.password=test1234
security.inter.broker.protocol=SSL


producer.properties:
bootstrap.servers=node1.openstacklocal:9092
security.protocol=SSL
ssl.truststore.location=/home/kafka/ssl/kafka.client.truststore.jks
ssl.truststore.password=test1234
ssl.keystore.location=/home/kafka/ssl/kafka.client.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234


consumer.properties:
security.protocol=SSL
ssl.truststore.location=/home/kafka/ssl/kafka.client.truststore.jks
ssl.truststore.password=test1234
ssl.keystore.location=/home/kafka/ssl/kafka.client.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234

Kafka Cluster Installation Steps


Kafka Cluster installation steps:
During Kafka server installation don’t use the local zookeeper . Use the Cloudera or your Big Data distribution zookeepers.
Kafka version should be same on all the nodes. use the Cloudera manager zookeeper for Kafka cluster.
Make sure the Java version is same on all the nodes.

Installation steps:
Step 1- Start the Kafka broker 0 on node-1 (assuming you have Kafka installed already)
go the directory where you have kafka installed :
ex: cd /usr/share/kafka_2.10-0.8.2.0
verify the server.properties , it should point to the broker .id=0(default).

Step -2 : login to node -2
go the directory where you have kafka installed :
ex: cd /usr/share/kafka_2.10-0.8.2.0
modify the server.properties file and create another kafka broker id 1 as below:
config/server.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1

Step -3 : login to node -3
go the directory where you have kafka installed :
ex: cd /usr/share/kafka_2.10-0.8.2.0
modify the server.properties file and create another kafka brokerid 2 as below:
config/server.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2

Step -4 .if you have zookeeper cluster than every node of the cluster modify modify the property zookeeper.connect from the file  kafka/config/server.properties:
zookeeper.connect=zNode01:2181,zNode02:2181,zNode03:2181

Run below command in all nodes to start kafka servers
/kafka/bin/kafka-server-start.sh /kafka/config/server.properties
----------------------
To confirm the brokers are running correctly, use this command on each node: ps -ef |grep server.properties

Struts Action Classes

Struts
Action classes
--------------
1) Action
2) DispatchAction
3) EventDispatchAction
4) LookupDispatchAction
5) MappingDispatchAction
6) DownloadAction
7) LocaleAction
8) ForwardAction
9) IncludeAction
10) SwitchAction
EventDispatchAction
-EventDispatchAction is a sub class of DispatchAction
-Developer must derive Action class from EventDispatchAction and must implement functions same as DispatchAction class
-But in DispatchAction the parameter attribute in struts-config.xml file contains request parameter name (like "method"), each form submitting to struts must submit request parameter name along with function want to be executed in DispatchAction class as request parameter value
-In EventDispatchAction, developer after implementing all the functions on Action class, the function names must be registered in parameter attribute with comma seperator. This is an indication that action class is registering all its functions as events in s-c.xml file. JSP which are requesting to the same action path must use request parameter name same as one of the event name registered in s-c.xml file. JSP those which doesn't satisfy this condition will get JSP compilation error
-Configuring EDA sub classes in s-c.xml
-During client request
If no event name is specified as request parameter the default event will be taken
-Otherwise
LookupDispatchAction
--------------------
-Developer must sub class Action class from LookupDispatchAction class and must implement one function called getKeyMethodMap()
-The getKeyMethodMap() function returns Map object
-The Map object contains messagekey mapped to function name
-During client request the form must submit request parameter name as "function" (this name is config in s-c.xml file) and its value (request parameter value) must match with one of the message value exist in ApplicationResources.properties file
-The moment RequestProcessor receives client request, RP read req parameter value and checks for the message value in AR.properties file, the corresponding message key is retrieved
-RP instantiates Action class, invokes getKeyMethodMap() function, lookupfor the message key in map object, retrieves corresponding value from Map object and assumes that value as business logic function in Action class
public class UserAction extends LookupDispatchAction
{
protected Map getKeyMethodMap()
{
Map m=new HashMap();
m.add("button.add","insert");
m.add("button.update","update");
return m;
}
insert(), update() functions are as usual
}
ApplicationResources.properties
-------------------------------
button.add=Add User
button.update=Update User
s-c.xml
-------
InsertUser.jsp
MappingDispatchAction
-The Action class must be sub classing from MappingDispatchAction with the function same as before
-The same Action class is mapped in s-c.xml file with different action paths and each such configuration contains parameter name that holds function want to be executed on Action class
UserInsert.jsp
UserUpdate.jsp
ForwardAction
-------------
-This Action is used to connect to java web components & web documents through Struts framework
-Functioning is same as RequestDispatcher.forward() operation in Servlets & tag of JSP
-When using this Action class developer need not have to derive any class from ForwardAction
-When client request action path, the path must be mapped to org.apache.struts.actions.ForwardAction class in struts-config.xml file and its parameter could be pointing to JSP, Servlet (old web components if any exist) and action path to another struts module within the same web application
IncludeAction
-------------
Same as ForwardAction except include() operation is used on JSP, Servlet and action paths of other modules
LocaleAction
------------
-This action class is used to change the localization of user
-When request comes to action path of struts-config.xml file map the same to LocaleAction class
-This class implicitly reads two request parameters "language" and "country" and changes Locale object information @ session level

Hibernate (Model Framework / ORM Framework)


Hibernate
Hibernate Framework
(Model Framework/ORM Framework)
ORM - Object Relation Mapping)
------------------------------
Features
1) Makes persistence (INSERT/UPDATE/DELETE) operations transparent (invisible) to developer
In traditional JDBC program developer used to
-obtain connection via --DriverManager.getConnection() method (or) via by connection pooling
--Determine to use Statement or PreparedStatement object
--SQL statement managements
--Consuming resultset
--ensuring Atomicity & Consistency
In Hibernate
--we must write one XML document named "hibernate-cfg.xml" that contains JDBC driver information
--Write one Java Bean class for each table whose properties are same as table columns and their types
--Optionally one XML document that maps each JavaBean to table and its properties, save the document as Student.hbm.xml file
--Input both cfg.xml and hbm.xml files to Hibernate classes. Hibernate classes will internally create conn, stmt etc objects for SQL operations
--If we instantiate JavaBean, assign all the class instance variables with values and invoke only one function on hibernate class called save(bean) as argument. Hibernate reads javabean values, prepares one dynamic SQL statement and inserts record into DB.
--Atomicity and consistency are managed by hibernate implicitly
--This way hibernate reduces the JDBC code to do SQL operations on DB
--That means in Hibernate what developer must do is map one Java class to table. Hibernate maps the java object to one entity / record in the DB
-Hibernate does object-entity relation management
-Not only hibernate does persistence operations, it caches all the objects stored via Hibernate and as and when the record is modified in the DB, hibernate updates the state of javabean also. It implicitly avoids inconsistent problems
1) Transparent persistence operations
2) Object level relationship instead of maintaining relationship @ DB level. This is to facilitate portable relationships across all the DBs
3) Instead of fetching records from the DB using SQL and operating on ResultSet, we can fetch objects directly from Hibernate using HQL
4) Caching
-Memory level caching

-Disk level caching

Spring AOP (Aspect Oriented Programming)

- The importance of AOP is to seperate secondary / cross cutting concerns (middleware service implementations - TX, security, logging, session management, state persistence etc services) from primary / core concerns (business logic implementation).
- AOP recommends to write one Advice for each service implementation
-The types of Advices are
i) Method before Advice - sub class of MethodBeforeAdvice
ii) After Advice - sub class of AfterReturningAdvice
iii) Around Advice - sub class of MethodInterceptor (3rd party vendor)

iv) Throws Advice - sub class of ThrowsAdvice