Sangala Shekhar Reddy: February 2017

Friday, February 17, 2017

How to upgrade Spark to 2.1.0

Step to upgrade Spark 2.1.0:

1. download required tag from https://github.com/apache/spark/tags and extract to spark folder
2. Go to spark folder and run below command
./dev/make-distribution.sh --name custom-spark --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.8 -Dhbase.version=1.0.0-cdh5.4.8 -Dflume.version=1.5.0-cdh5.4.8 -Dzookeeper.version=3.4.5-cdh5.4.8 -Phive -Phive-thriftserver

3. output of above command is spark-2.1.0-bin-custom-spark.tgz
- copy spark-2.1.0-bin-custom-spark.tgz to /usr/lib/spark/ and extract and then delete spark-2.1.0-bin-custom-spark.tgz
scp -r root@:/spark_2.1.0/spark-2.1.0/shekhar/spark-2.1.0-bin-custom-spark.tgz /usr/lib/spark/.
tar xvzf spark-2.1.0-bin-custom-spark.tgz
rm -r spark-2.1.0-bin-custom-spark.tgz

(for backup) hdfs dfs -get /user/spark/share/lib /root/shekhar (spark:spark)

4. create soft likes
ln -s /var/run/spark/work /usr/lib/spark/work
ln -s /etc/spark/conf /etc/alternatives/spark-conf
ln -s /usr/lib/spark/jars /usr/lib/spark/lib

5. update /usr/lib/spark/bin/spark-submit
vi /usr/lib/spark/bin/spark-submit
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/yarn/conf

6. run below command
chmod -R 777 /tmp/hive
export JAVA_HOME=/usr/

Steps to create web service Producer(Using annatation and bean config) and Consuming webservice

Steps to create web service (Producer)

First down load the required jar files

a. Go to http://xfire.codehaus.org/Download
b. down load xfire-distribution-1.2.6.zip file
c. extract this and go to lib folder
d. And copy all jars to your application.

1.add xfire.xml file

2.configure org.codehaus.xfire.spring.remoting.XFireExporter

3.configure handler mapping

citationService.xfire

4.write proxy interface

TestInterface:
package com.optrasystems.mvc;
public interface TestInterface {
public String sayHello();
}

5.write a class with exposed methods

Test.java;

package com.optrasystems.mvc;
public class Test implements TestInterface{
public String sayHello(){

String string="old String";
return string;
}
}

Consuming above webservice
==========================
first generate the java file using wsdl file

select project right click ---> new -->other-->select "web service client" (in webservice) --->click on 'next' button -->provide wsdl path in 'service defination' input box and decrease valume type bar to 'Develop client'---> click on finish

1. using Xfire:

XFireClientFactoryBean client=new XFireClientFactoryBean();
try{
client.setServiceClass(Class.forName("com.optrasystems.mvc.TestInterface"));
client.setWsdlDocumentUrl("http://192.168.100.151:8089/thorlabs/app/citationService?WSDL");
client.afterPropertiesSet();
client.getObject();
TestInterface lTestInterface=(TestInterface)(client.getObject()));
String str=lTestInterface.sayHello();
request.getSession().setAttribute("webservice", str);
}catch(java.lang.Exception e){
e.printStackTrace();
}

==================================================
Creating web service using Spring Annatation

step 1: create proxy interface

@WebService
public interface PilotSystemWebservice {
//declare all exposed methods here.
}

step 2: create implementation class for the above interface

@WebService(serviceName="shekhar",endpointInterface="com.optrasystems.service.PilotSystemWebservice")
public class PilotSystemWebserviceImpl implements PilotSystemWebservice {
// implement all method.
}

step 3: add the following bean configuration to Spring context.

step4: WSDL file Url is
http://localhost:8089/thorlabs/app/services/shekhar?wsdl

thorlads : application name
app : url-pattern of DispatcherServlet configured in web.xml
services: constant url for all ,this sanded url provided by Spring
shekhar :service name provided in Interface( serviceName="shekhar")

step 5: client will be same(above one) for this application

How to write and use Hive UDAF

How to Write Hive UDAF:

1. Create Java class which extends org.apache.hadoop.hive.ql.exec.hive.UDAF;
2. Create Inner Class which implements UDAFEvaluator
3. Implement five methods ()
init() – The init() method initializes the evaluator and resets its internal state. We are using new Column() in code below to indicate that no values have been aggregated yet.
iterate() – this method is called every time there is anew value to be aggregated. The evaluator should update its internal state with the result of performing the aggregation (we are doing sum – see below). We return true to indicate that input was valid.
terminatePartial() – this method is called when Hive wants a result for the partial aggregation. The method must return an object that encapsulates the state of the aggregation.
merge() – this method is called when Hive decides to combine one partial aggregation with another.
terminate() – this method is called when the final result of the aggregation is needed.
4. Compile and Package JAR
5. CREATE TEMPORARY FUNCTION in hive CLI
6. Run Aggregation Query – Verify Output!!!

package org.hardik.letsdobigdata;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.hardik.letsdobigdata.MeanUDAF.MeanUDAFEvaluator.Column;

@Description(name = "Mean", value = "_FUNC(double) - computes mean", extended = "select col1, MeanFunc(value) from table group by col1;")
public class MeanUDAF extends UDAF {

// Define Logging
static final Log LOG = LogFactory.getLog(MeanUDAF.class.getName());

public static class MeanUDAFEvaluator implements UDAFEvaluator {

/**
* Use Column class to serialize intermediate computation
* This is our groupByColumn
*/
public static class Column {
double sum = 0;
int count = 0;
}

private Column col = null;

public MeanUDAFEvaluator() {
super();
init();
}
// A - Initalize evaluator - indicating that no values have been
// aggregated yet.

public void init() {
LOG.debug("Initializeuator");
col = new Column();
}

// B- Iterate every time there is a new value to be aggregated
public boolean iterate(double value) throws HiveException {
LOG.debug("Iterating each value for aggregation");
if (col == null)
throw new HiveException("Item is not initialized");
col.sum = col.sum + value;
col.count = col.count + 1;
return true;
}
// C - Called when Hive wants partially aggregated results.
public Column terminatePartial() {
LOG.debug("Returnially aggregated results");
return col;
}
// D - Called when Hive decides to combine one partial aggregation with another
public boolean merge(Column other) {
LOG.debug("mergingombining partial aggregation");
if(other == null) {
return true;
}
col.sum += other.sum;
col.count += other.count;
return true;
}
// E - Called when the final result of the aggregation needed.
public double terminate(){
LOG.debug("Attend of last record of the group - returning final result");
return col.sum/col.count;
}

}
}

CREATE TABLE IF NOT EXISTS orders (order_id int, order_date string, customer_id int, amount int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/root/utaf.txt' INTO TABLE orders;

101,2016-01-01,7,3540
102,2016-03-01,1,240
103,2016-03-02,6,2340
104,2016-02-12,3,5000
105,2016-02-12,3,5500
106,2016-02-14,9,3005
107,2016-02-14,1,20
108,2016-02-29,2,2000
109,2016-02-29,3,2500
110,2016-02-27,1,200

add jar /root/HiveUDFs-master-0.0.1-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION MeanFunc AS 'org.hardik.letsdobigdata.MeanUDAF';

select customer_id, MeanFunc(amount) from orders group by customer_id;

Hive UDF and UDAF Example

How to write UDF function in Hive?

1. Create Java class for User Defined Function which extends ora.apache.hadoop.hive.sq.exec.UDF amd implement more than one evaluate() methods and put your desisred logic and you are almost there.
2. Package your Java class into JAR file (I am using maven)
3. Go to Hive CLI – ADD your JAR, verify your JARs in Hive CLI classpath
4. CREATE TEMPORARY FUNCTION in hive which points to your Java class
5. Use it in Hive SQL and have fun!

package org.hardik.letsdobigdata;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class Strip extends UDF {

private Text result = new Text();
public Text evaluate(Text str, String stripChars) {
if(str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
public Text evaluate(Text str) {
if(str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
}

ADD JAR /root/HiveUDFs-master-0.0.1-SNAPSHOT.jar;
list jars;
CREATE TEMPORARY FUNCTION STRIP AS 'org.hardik.letsdobigdata.Strip';
CREATE TABLE IF NOT EXISTS dummy (Value1 VARCHAR(64), Value2 VARCHAR(64)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/root/ajay.txt' INTO TABLE dummy;

select strip('hadoop','ha') from dummy;

OK
_c0
doop
doop
Time taken: 0.148 seconds, Fetched: 2 row(s)

How to use Hive Bucket Example

How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

set hive.mapred.mode=unstrict; - to access partitioned table without where condition
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.enforce.bucketing = true;

DROP TABLE IF EXISTS temp_user;
CREATE TEMPORARY TABLE temp_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
country VARCHAR(64),
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

vi /root/data_hive.csv
Rebbecca,Didio,171 E 24th St,AU,Leith,TA,7315,03-8174-9123,0458-665-290,rebbecca.didio@didio.com.au,http://www.brandtjonathanfesq.com.au
Stevie,Hallo,22222 Acoma St,AU,Proston,QL,4613,07-9997-3366,0497-622-620,stevie.hallo@hotmail.com,http://www.landrumtemporaryservices.com.au
Mariko,Stayer,534 Schoenborn St #51,AU,Hamel,WA,6215,08-5558-9019,0427-885-282,mariko_stayer@hotmail.com,http://www.inabinetmacreesq.com.au
Gerardo,Woodka,69206 Jackson Ave,AU,Talmalmo,NS,2640,02-6044-4682,0443-795-912,gerardo_woodka@hotmail.com,http://www.morrisdowningsherred.com.au
Chun,Richrdson,3 Aiea Heights #660,CA,Regina,SK,S4T 3L1,306-245-2534,306-697-2337,chun_richrdson@richrdson.org,http://www.hoytrobertfesq.com
Lelia,Thiemann,440 Town Center Dr,CA,Kamloops,BC,V2B 7W6,250-671-3851,250-798-7786,lelia.thiemann@yahoo.com,http://www.kleemandenaaesq.com
Cordell,Zinda,91 Argyle Rd,CA,Sherbrooke,QC,J1H 6E3,819-508-6057,819-313-7350,cordell_zinda@cox.net,http://www.kayejeffreyaesq.com
Dorothy,Aitken,4 Hanover Pike,CA,Mississauga,ON,L5V 1E5,905-554-3838,905-355-9556,dorothy.aitken@cox.net,http://www.mcmillonwendyaesq.com
Nobuko,Halsey,8139 I Hwy 10 #92,US,New Bedford,MA,2745,508-855-9887,508-897-7916,nobuko.halsey@yahoo.com,http://www.goemanwoodproductsinc.com
Lavonna,Wolny,5 Cabot Rd,US,Mc Lean,VA,22102,703-483-1970,703-892-2914,lavonna.wolny@hotmail.com,http://www.linhareskennethaesq.com
Lashaunda,Lizama,3387 Ryan Dr,US,Hanover,MD,21076,410-678-2473,410-912-6032,llizama@cox.net,http://www.earnhardtprinting.com
Mariann,Bilden,3125 Packer Ave #9851,US,Austin,TX,78753,512-223-4791,512-742-1149,mariann.bilden@aol.com,http://www.hpgindustrysinc.com

LOAD DATA LOCAL INPATH '/root/data_hive.csv' INTO TABLE temp_user;
DROP TABLE IF EXISTS bucketed_user;
CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
) COMMENT 'A bucketed sorted user table' PARTITIONED BY (country VARCHAR(64)) CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS STORED AS TEXTFILE;

set hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE bucketed_user PARTITION (country) SELECT firstname ,lastname , address , city , state , post , phone1 , phone2 , email , web , country FROM temp_user;