1

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.

Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says

Connect through JDBC or ODBC.

A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.

The Spark SQL Programming Guide says

Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any 
code.

So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?

2 Answers 2

2

Configure those properties in spark-default.conf file

spark.cassandra.connection.host    192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username   smb
spark.cassandra.auth.password   bigdata@123

Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.

sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar

For getting cassandra table run Spark-SQL queries like

CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
Sign up to request clarification or add additional context in comments.

10 Comments

The "USING" could be the missing link I couldn't figure out - I'll give it a go and see if it works!
With the Thrift server turned on, can I connect to it directly using JDBC UI eg Squirrel SQL? Do I need a specific client jar on my JDBC UI classpath in order to connect to the thrift server?
spark/bin/beeline -u jdbc:hive2://192.168.1.14:10000
Sorry, I was travelling for most of December so have only just had a chance to try this out. Beeline connects perfectly! Adding the jars was annoying as there are a number of dependencies (cassandra-connector, guava, cassandra-core, etc) so I created a big shaded bundle with maven, and voila!
Now to see if I can connect to it from Squirrel SQL :)
|
1

why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

1 Comment

I have done this in a job jar, but I still need to use the spark-submit script to submit my SQL job onto spark, which requires command-line access. Is it possible to run SQL directly from a PC, connecting to Spark/Cassandra?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.