2

I have the following basic script that works fine using pycharm on my machine.

from pyspark.sql import SparkSession

print("START")

spark = SparkSession \
    .Builder() \
    .appName("myapp") \
    .master('local[*, 4]') \
    .getOrCreate()

print(spark)

data = [('James', '', 'Smith', '1991-04-01', 'M', 3000),
        ('Michael', 'Rose', '', '2000-05-19', 'M', 4000),
        ('Robert', '', 'Williams', '1978-09-05', 'M', 4000),
        ('Maria', 'Anne', 'Jones', '1967-12-01', 'F', 4000),
        ('Jen', 'Mary', 'Brown', '1980-02-17', 'F', -1)
        ]

columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
print(df)

However when trying to run on a databricks cluster, directly through python script it gives an error.

START Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Workspace/Repos/***********/sdk_test/tests/snippets/spark_tests.py", line 13, in class SparkTests: File "/Workspace/Repos/*******/sdk_test/tests/snippets/spark_tests.py", line 16, in SparkTests sc = SparkContext.getOrCreate() File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/databricks/spark/python/pyspark/context.py", line 147, in init self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer, File "/databricks/spark/python/pyspark/context.py", line 192, in _do_init raise RuntimeError("A master URL must be set in your configuration") RuntimeError: A master URL must be set in your configuration CalledProcessError: Command 'b'cd ../\n\n/databricks/python3/bin/python -m tests.snippets.spark_tests\n# python -m tests.runner --env=qa --runtime_env=databricks --upload=True --package=sdk\n'' returned non-zero exit status 1.

What am I missing?

2 Answers 2

0

Remove .master('local[*, 4]') \ from your code. Master is set automatically on Databricks.

Sign up to request clarification or add additional context in comments.

4 Comments

still gives the same error :\ I"m running through python file not directly from notebook
Show how do you trigger the job
%%sh cd ../ /databricks/python3/bin/python tests/snippets/spark_tests.py
it's not the best way of running tests, etc. Look onto dbx instead: docs.databricks.com/dev-tools/dbx.html
0

Your code works absolutely fine in Databricks:

Output Image

I will suggest you to restart your databricks cluster and run it again. And check command line arguments for running script.

1 Comment

Thanks for reply, through a notebook it works fine, but if I run it through the python shell (or trigger the script through a job) it throws this error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.