2

I need to submit a py file with Apache Spark Hidden REST API As I followed arturmkrtchyan tutorial, I couldn't find any example or document regarding how to submit a py file.

Does anyone have any idea? Is it possible to replace py file instead of jar:

curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
    "action" : "CreateSubmissionRequest",
      "appArgs" : [ "myAppArgument1" ],
      "appResource" : "file:/path/to/py/file/file.py",
      "clientSparkVersion" : "1.5.0",
      "environmentVariables" : {
        "SPARK_ENV_LOADED" : "1"
      },
      "mainClass" : "com.mycompany.MyJob",
      "sparkProperties" : {
        "spark.submit.pyFiles": "/path/to/py/file/file.py",
        "spark.driver.supervise" : "false",
        "spark.app.name" : "MyJob",
        "spark.eventLog.enabled": "true",
        "spark.submit.deployMode" : "cluster",
        "spark.master" : "spark://spark-cluster-ip:6066"
      }
    }'

Or is there any other way to do it?

0

1 Answer 1

3
+50

The approach is actually similar to the one described in the link that you have shared.

Here is an example :

Let's define first out python script which we need to run. I took the spark pi example, i.e spark_pi.py :

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

You'll need to make sure that /tmp/spark-events is already there before you run the job.

Now you can submit as followed :

curl -X POST http://[spark-cluster-ip]:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
   "action":"CreateSubmissionRequest",
   "appArgs":[
      "/home/eliasah/Desktop/spark_pi.py"
   ],
   "appResource":"file:/home/eliasah/Desktop/spark_pi.py",
   "clientSparkVersion":"2.2.1",
   "environmentVariables":{
      "SPARK_ENV_LOADED":"1"
   },
   "mainClass":"org.apache.spark.deploy.SparkSubmit",
   "sparkProperties":{
      "spark.driver.supervise":"false",
      "spark.app.name":"Simple App",
      "spark.eventLog.enabled":"true",
      "spark.submit.deployMode":"cluster",
      "spark.master":"spark://[spark-master]:6066"
   }
}' 

As you've noticed, we have provided the file path to our script as application resources and also in the application arguments.

PS: Replace [spark-cluster-ip] and [spark-master] with the proper values correspondant to your spark cluster.

This will result with the following :

{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20180522165321-0001",
  "serverSparkVersion" : "2.2.1",
  "submissionId" : "driver-20180522165321-0001",
  "success" : true
}

You can also check the Spark UI to monitor your job.

To use arguments with the input script, you can add them to the appArgs property :

"appArgs": [ "/home/eliasah/Desktop/spark_pi.py", "arg1" ]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.