6

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:

  1. In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?

  2. Where and how do we use HiveThriftServer2.startWithContext?

  3. When we run start-thriftserver.sh as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001

besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?

  1. Are there any other things we need to do?

Thanks.

1
  • Note that the question was not about exposing Hive tables. The question was about How to expose RDD tables / Dataframes of a Spark SQL program through thrift-server. For example, say my Spark SQL program provides its own RDD tables / Dataframes. And it registers them as DataFrame.registerTempTable. How does it expose those RDD tables / Dataframes to Thrift server, so that external applications can access them via JDBC ? Commented Jul 19, 2015 at 0:06

2 Answers 2

6

To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.

To your questions:

  1. HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell

  2. Write a simple application, example :

    import  org.apache.spark.sql.hive.thriftserver._  
    val  hiveContext  =  new  HiveContext(sparkContext)
    hiveContext.parquetFile(path).registerTempTable("my_table1")
      HiveThriftServer2.startWithContext(hiveContext)
  1. No need to run start-thriftserver.sh, but run your own application instead, e.g.:

spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar

  1. Nothing else from server side, should start on default port 10000; you may verify by connecting to the server with beeline.
Sign up to request clarification or add additional context in comments.

3 Comments

Haiying, your solution works! I was able to connect via beeline to query the registered temp table. Thank you very much for your help.
@michael I was just wondering why this answer was not accepted as owner? Is there any specific reason?
@RamGhadiyaram Sorry. I missed your question. I accepted this answer.
1

In Java I was able to expose dataframe as temp tables and read the table content via beeline (as like regular hive table)

I have n't posted the entire program (with the assumption that you know already how to create dataframes)

import org.apache.spark.sql.hive.thriftserver.*;

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);

orgPairRdd is a JavaPairRDD, orgPairRdd.values() -> contains the entire class value (Row fetched from Hbase)

OrgMaster is a java bean serializable class

orgDf.registerTempTable("spark_org_master_table");

HiveThriftServer2.startWithContext(sqlContext);

I submitted the program locally (as the Hive thrift server is not running in port 10000 in that machine)

hadoop_classpath=$(hadoop classpath)
HBASE_CLASSPATH=$(hbase classpath)

spark-1.5.2/bin/spark-submit   --name tempSparkTable     --class packageName.SparkCreateOrgMasterTableFile  --master local[4]   --num-executors 4    --executor-cores 4    --executor-memory 8G   --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}"   --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}"    --conf "spark.executor.extraClassPath=${hadoop_classpath}"  --conf  --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar  
    /path/programName-SNAPSHOT.jar

In another terminal start the beeline pointing to this thrift service started using this spark program

/opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername

Show tables -> command will display the table that you registered in Spark

You can do describe also

In this example

describe spark_org_master_table;

then you can run regular queries in beeline against this table. (Until you kill the spark program execution)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.