8

I am fairly new to Databricks, so forgive me for the lack of knowledge here. I am using the Databricks resource in Azure. I mainly use the UI right now, but I know some features are only available using databricks-cli, which I have setup but not used yet.

I have cloned my Git repo in Databricks Repos using the UI. Inside my repo, there is a Python file that I will like to run as a job.

Can I use Databricks Jobs to create a job that will call this Python file directly ? The only way that I have been able to make this work is to create and upload to dbfs another Python file that will call the file in my Databricks Repo.

Maybe it cannot be done, or maybe the path I use is incorrect. I tried with the following path structure when creating a job using a Python file and it did not work, unfortunately.

file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py
2
  • is it just normal Python file, not the notebook? Commented Nov 24, 2021 at 15:05
  • 1
    Yes, I want to use a normal Python file, located in Workspace/Repos/<user_folder>/<repo_name>/ Commented Nov 24, 2021 at 15:31

4 Answers 4

7

One workaround is to create a wrapper notebook that calls this file, i.e.

from my_python_file import main
main()

Then you can schedule a job on this notebook

Sign up to request clarification or add additional context in comments.

3 Comments

That is what I am using right now. I would prefer not to have a wrapper notebook, but it works and it is simple.
@EmiliePicard-Cantin can you help me out? I have exactly the same problem as you. But when I say "from my_python_file import main" in the wrapper notebook it says "No module named "my_python_file". Did you have to do anything special to make this wrapper solution work?
@BrendanHill I have had the same problem. Are your notebook and python file in the same folder ? It worked for me when they were in the exact same folder. Otherwise, I will have to do more digging.
4

I resolved this by adding markdown to my python script, so Databricks recognize it as a Databricks notebook:

# Databricks notebook source

# COMMAND ----------
import pyspark.sql.functions as f

df = spark.createDataFrame([
    (1,2)
], ['test_1', 'test_2'])

Comments

0

1- install in VS studio databricks-cli by typing pip install databricks-cli

From https://docs.databricks.com/dev-tools/cli/index.html

2- upload your python .py file into azure storage mounted on databricks (check how to mount azure storage on databricks)
3- connect to databricks from cli by typing in vs code terminal
Databricks configure --token
It will ask you for databricks instance URL then will ask you for personal token (you can generate that in settings in databricks check on how to generate token)

4- create databricks job instance by typing in terminal Databricks jobs create --json-file create-job.json

Contents of create-job.json

{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_F4",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/mnt/xxxxxx/raw/databricks-connectivity-test.py",
    "parameters": [
      "10"
    ]
  }
}

this information I gathered from youtube video below https://www.youtube.com/watch?v=XZFN0hOA8mY&ab_channel=JonWood

2 Comments

5- Run Job from databricks cli. Just type below in vs studio <br> databricks jobs run-now --job-id 95
the question isn't about file on DBFS, but about file Repos - it's a different thing
0

Here is an example of a way to use the databricks sdk to run a python file in a path like given above. file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

Get the latest sdk. https://pypi.org/project/databricks-sdk/

pip install databricks-sdk

Replace the text in the below variables host, token, python_path and c.cluster_name.


from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs, compute
import time

def main():
    #auth: https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html
    w = WorkspaceClient(
        host="https://...",
        token="YOUR_TOKEN"
    )
    python_path = "/Repos/<user_folder>/<repo_name>/my_python_file.py"
    cluster_id = None
    for c in w.clusters.list():
        if c.cluster_name == "CLUSTER_NAME":
            cluster_id = c.cluster_id
    #Create and run a job: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html
    created_job = w.jobs.create(name=f'sdk-test-{time.time_ns()}',
                                tasks=[
                                    jobs.Task(
                                        description="test-job-desc",
                                        existing_cluster_id=cluster_id,
                                        spark_python_task=jobs.SparkPythonTask(python_file=python_path),
                                        task_key='test-job-key',
                                        timeout_seconds=0,
                                        # Add dependant libraries like pytest
                                        libraries=[
                                            compute.Library(
                                                pypi=compute.PythonPyPiLibrary(package='pytest')
                                            )
                                        ]
                                    )
                                ])
    run_by_id = w.jobs.run_now(job_id=created_job.job_id).result()
    # # Uncomment the following section to print out details
    # for i in run_by_id.__dict__:
    #     print(i, ":", run_by_id.__dict__[i])

    # cleanup
    w.jobs.delete(job_id=created_job.job_id)


if __name__ == "__main__":
    main()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.