Sunday, April 29, 2018

Accessing HBase using Spark Application

Accessing HBase using Spark Application:

Exact displayed issues can differ depending on the CDH version. General examples include:
  • ZooKeeper issues:
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
14/12/01 23:54:51 INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
14/12/01 23:54:51 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

  • HTrace class compatibility issue (CDH 5.4 and later only):

ERROR yarn.ApplicationMaster: User class threw exception: java.lang.reflect.InvocationTargetException
java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
...
Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace

  • Kerberos related issue in Spark cluster mode:
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) 
... 4 more 
Caused by: java.lang.RuntimeException: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'. 
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:686) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) 
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:644) 
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:752) 
... 17 more 
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException

Cause
  • The client configuration that is stored in /etc/hbase/conf/ is not included in the path
  • Jars that have the dependent code are not on the class path
Fix:
These errors are related because the default setup of Spark does not contain any of the HBase classes or configuration. Spark does not have a dependency on HBase. Any classes or configuration that is pulled into the execution environment is the result of a secondary dependency. For instance Hive, which is a Spark dependency, pulls in some of the HBase classes because it is capable of querying HBase data. However, any Spark application needs to provide all dependencies and not rely on any dependencies being available.


  1. Add the /etc/hbase/conf/ directory and dependencies (jars or classes) to the path when the application is submitted. As described above the htrace dependency is required for CDH 5.4 and later. Other dependencies might be required and the exact addition to the class path should be tested in the environment while submitting an application:
  • Spark driver class path change using the command line:
--driver-class-path /etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar
  • Spark executor class path change using the command line:
--conf "spark.executor.extraClassPath=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
  • or by adding the setting the executor class path in the spark-defaults.conf file (see below)
  • Use the --files /etc/hbase/conf/ option to pass the configuration files to the cluster nodes.
  1. Add the Hbase jars and configuration to the executor class path using the Spark defaults configuration file:
  • From Cloudera Manager, navigate to Spark on YARN > Configuration
  • Type defaults in the search box
  • Select Gateway in the scope (this opens Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf)
  • Add the entry:
spark.executor.extraClassPath=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar
  • Save the change and an icon appears to deploy client configuration (can take 30 seconds to show)
  1. If we are running a Spark application in cluster mode then we have to set the HADOOP_CONF_DIR variable. Setting HADOOP_CONF_DIR will add the HBase configurations to the Spark launcher classpath (!= driver classpath in cluster mode). This is needed for Spark to obtain HBase delegation tokens. Add the HADOOP_CONF_DIR environment variable using the Spark client spark-env.sh configuration file:
    • From Cloudera Manager, navigate to Spark on YARN > Configuration
    • Type spark-env.sh in the search box
    • Select Gateway in the scope (this opens Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh )
    • Add the entry:
    export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf:/etc/hbase/conf
    
    • Save the change and an icon appears to deploy client configuration (can take 30 seconds to show)
  2. Deploy the client configuration as directed by the GUI
  3. Run the test by executing the following:
spark-submit --master yarn-cluster --driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar \
--class org.apache.spark.examples.HBaseTest /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar test
NOTE 1: The example table name stored in HBase is called test.

NOTE 2: If you see this error message in the YARN container logs: 
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions for user 'username' (table=test, action=READ)
then you need to configure the proper hbase permissions for your user. For example:
# sudo -u hbase kinit -kt hbase.service.keytab hbase/hbase_fqdn_host@HADOOP.COM
# sudo -u hbase hbase shell
hbase(main):002:0> grant  'username', 'RWCA'
0 row(s) in 3.5870 seconds 
hbase(main):003:0> quit



1 comment:

veera said...

Very nice blog ,keep sharing more posts with us.