0

I am newbie in Spark. I have an application which run each spark sql queries by invoking spark-shell. So it will generate a set of query like below and invoke spark-shell command to process these queries one by one.

val Query=spark.sql(""" SELECT userid AS userid, rating AS rating, movieid AS movieid FROM default.movieTable""");

Now I want to run this application using spark-submit instead of spark-shell. Can anybody tell how to do that?

1 Answer 1

3

If you are using scala, spark-submit takes in a jar file, you will have to create a scala project with sbt as the dependency/build tool, sbt can take all your code and bundle it into a jar file. You can follow this guide. - Similar approaches exist for python and java

Update1: spark-shell is intended to be used for conducting quick experiments, when spark-shell is invoked, it comes with SparkSession instantiated automatically, so when you want to achieve this programatically, you would need to invoke this programatically.

For ex:

val sparkSession: SparkSession = 
SparkSession.builder.appName("awesomeApp").getOrCreate()

// This import is needed to use the $-notation, and imported automatically in `spark-shell` by default
import sparkSession.implicits._

...
//code to generate/import/build your `movieTable` view/table
...

val queryOutputDf=sparkSession.sql(""" SELECT userid AS userid, rating AS rating, movieid AS movieid FROM default.movieTable""");

//the above output is a `dataframe`, it needs to be written to a file
queryOutputDf.rdd.map(_.toString()).saveAsTextFile("/path/to/a/file/with/good/name")

This would achieve your intention for a single query, you would have to loop through your queries and pass it to the above.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the response. So currently my application will take each query and pass it to spark-shell.Just like we are typing queries interactively. When I am submitting to spark-submit, My current application won't be able to run each queries. I wanted to bundle these queries and invoke the spark-submit. How can I do that? How these queries should be organised? Any idea on that? Correct me if I am going in the wrong direction.
Thank you. So, in my case I have to dynamically create the class and add the queries to it. The queries are stored in the map. I have to get each query and add it to class. once the class is created I am going to trigger spark-submit using the java processbuilder. Is there any other better idea?
@far2c You could drive everything by a property file, it totally depends numerous parameters, like volume of data, parallelism, scheduling etc, you can lookup more about spark scheduling

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.