How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Question

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:

In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001

besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?

Are there any other things we need to do?

Thanks.

Note that the question was not about exposing Hive tables. The question was about How to expose RDD tables / Dataframes of a Spark SQL program through thrift-server. For example, say my Spark SQL program provides its own RDD tables / Dataframes. And it registers them as DataFrame.registerTempTable. How does it expose those RDD tables / Dataframes to Thrift server, so that external applications can access them via JDBC ? — Michael
– Michael, Commented Jul 19, 2015 at 0:06

Ram Ghadiyaram · Accepted Answer · 2018-10-17 04:49:33Z

6

To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.

To your questions:

HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell
Write a simple application, example :

    import  org.apache.spark.sql.hive.thriftserver._  
    val  hiveContext  =  new  HiveContext(sparkContext)
    hiveContext.parquetFile(path).registerTempTable("my_table1")
      HiveThriftServer2.startWithContext(hiveContext)

No need to run start-thriftserver.sh, but run your own application instead, e.g.:

spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar

Nothing else from server side, should start on default port 10000; you may verify by connecting to the server with beeline.

edited Oct 17, 2018 at 4:49

Ram Ghadiyaram

29.4k16 gold badges102 silver badges133 bronze badges

answered Jul 20, 2015 at 16:38

Haiying Wang

6527 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Michael Over a year ago

Haiying, your solution works! I was able to connect via beeline to query the registered temp table. Thank you very much for your help.

Ram Ghadiyaram Over a year ago

@michael I was just wondering why this answer was not accepted as owner? Is there any specific reason?

Michael Over a year ago

@RamGhadiyaram Sorry. I missed your question. I accepted this answer.

Anandkumar · Accepted Answer · 2016-07-15 15:55:17Z

In Java I was able to expose dataframe as temp tables and read the table content via beeline (as like regular hive table)

I have n't posted the entire program (with the assumption that you know already how to create dataframes)

import org.apache.spark.sql.hive.thriftserver.*;

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);

orgPairRdd is a JavaPairRDD, orgPairRdd.values() -> contains the entire class value (Row fetched from Hbase)

OrgMaster is a java bean serializable class

orgDf.registerTempTable("spark_org_master_table");

HiveThriftServer2.startWithContext(sqlContext);

I submitted the program locally (as the Hive thrift server is not running in port 10000 in that machine)

hadoop_classpath=$(hadoop classpath)
HBASE_CLASSPATH=$(hbase classpath)

spark-1.5.2/bin/spark-submit   --name tempSparkTable     --class packageName.SparkCreateOrgMasterTableFile  --master local[4]   --num-executors 4    --executor-cores 4    --executor-memory 8G   --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}"   --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}"    --conf "spark.executor.extraClassPath=${hadoop_classpath}"  --conf  --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar  
    /path/programName-SNAPSHOT.jar

In another terminal start the beeline pointing to this thrift service started using this spark program

/opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername

Show tables -> command will display the table that you registered in Spark

You can do describe also

In this example

describe spark_org_master_table;

then you can run regular queries in beeline against this table. (Until you kill the spark program execution)

Collectives™ on Stack Overflow

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related