How to create a Databricks job using a Python file outside of dbfs?

Question

I am fairly new to Databricks, so forgive me for the lack of knowledge here. I am using the Databricks resource in Azure. I mainly use the UI right now, but I know some features are only available using databricks-cli, which I have setup but not used yet.

I have cloned my Git repo in Databricks Repos using the UI. Inside my repo, there is a Python file that I will like to run as a job.

Can I use Databricks Jobs to create a job that will call this Python file directly ? The only way that I have been able to make this work is to create and upload to dbfs another Python file that will call the file in my Databricks Repo.

Maybe it cannot be done, or maybe the path I use is incorrect. I tried with the following path structure when creating a job using a Python file and it did not work, unfortunately.

file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

Yes, I want to use a normal Python file, located in Workspace/Repos/<user_folder>/<repo_name>/ — Emilie Picard-Cantin
– Emilie Picard-Cantin, Commented Nov 24, 2021 at 15:31

Zi Dong · Accepted Answer · 2021-11-24 20:23:56Z

7

One workaround is to create a wrapper notebook that calls this file, i.e.

from my_python_file import main
main()

Then you can schedule a job on this notebook

answered Nov 24, 2021 at 20:23

Zi Dong

611 bronze badge

Sign up to request clarification or add additional context in comments.

3 Comments

Emilie Picard-Cantin Over a year ago

That is what I am using right now. I would prefer not to have a wrapper notebook, but it works and it is simple.

Brendan Hill Over a year ago

@EmiliePicard-Cantin can you help me out? I have exactly the same problem as you. But when I say "from my_python_file import main" in the wrapper notebook it says "No module named "my_python_file". Did you have to do anything special to make this wrapper solution work?

Emilie Picard-Cantin Over a year ago

@BrendanHill I have had the same problem. Are your notebook and python file in the same folder ? It worked for me when they were in the exact same folder. Otherwise, I will have to do more digging.

ARCrow · Accepted Answer · 2022-06-16 15:20:41Z

4

I resolved this by adding markdown to my python script, so Databricks recognize it as a Databricks notebook:

# Databricks notebook source

# COMMAND ----------
import pyspark.sql.functions as f

df = spark.createDataFrame([
    (1,2)
], ['test_1', 'test_2'])

answered Jun 16, 2022 at 15:20

ARCrow

1,8673 gold badges14 silver badges34 bronze badges

Comments

Moe · Accepted Answer · 2022-01-20 02:36:47Z

0

1- install in VS studio databricks-cli by typing pip install databricks-cli

From https://docs.databricks.com/dev-tools/cli/index.html

2- upload your python .py file into azure storage mounted on databricks (check how to mount azure storage on databricks)
3- connect to databricks from cli by typing in vs code terminal
Databricks configure --token
It will ask you for databricks instance URL then will ask you for personal token (you can generate that in settings in databricks check on how to generate token)

4- create databricks job instance by typing in terminal Databricks jobs create --json-file create-job.json

Contents of create-job.json

{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_F4",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/mnt/xxxxxx/raw/databricks-connectivity-test.py",
    "parameters": [
      "10"
    ]
  }
}

this information I gathered from youtube video below https://www.youtube.com/watch?v=XZFN0hOA8mY&ab_channel=JonWood

answered Jan 20, 2022 at 2:36

Moe

1

2 Comments

Moe Over a year ago

5- Run Job from databricks cli. Just type below in vs studio <br> databricks jobs run-now --job-id 95

Alex Ott Over a year ago

the question isn't about file on DBFS, but about file Repos - it's a different thing

Jortega · Accepted Answer · 2024-03-07 22:29:15Z

Here is an example of a way to use the databricks sdk to run a python file in a path like given above. file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

Get the latest sdk. https://pypi.org/project/databricks-sdk/

pip install databricks-sdk

Replace the text in the below variables host, token, python_path and c.cluster_name.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs, compute
import time

def main():
    #auth: https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html
    w = WorkspaceClient(
        host="https://...",
        token="YOUR_TOKEN"
    )
    python_path = "/Repos/<user_folder>/<repo_name>/my_python_file.py"
    cluster_id = None
    for c in w.clusters.list():
        if c.cluster_name == "CLUSTER_NAME":
            cluster_id = c.cluster_id
    #Create and run a job: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html
    created_job = w.jobs.create(name=f'sdk-test-{time.time_ns()}',
                                tasks=[
                                    jobs.Task(
                                        description="test-job-desc",
                                        existing_cluster_id=cluster_id,
                                        spark_python_task=jobs.SparkPythonTask(python_file=python_path),
                                        task_key='test-job-key',
                                        timeout_seconds=0,
                                        # Add dependant libraries like pytest
                                        libraries=[
                                            compute.Library(
                                                pypi=compute.PythonPyPiLibrary(package='pytest')
                                            )
                                        ]
                                    )
                                ])
    run_by_id = w.jobs.run_now(job_id=created_job.job_id).result()
    # # Uncomment the following section to print out details
    # for i in run_by_id.__dict__:
    #     print(i, ":", run_by_id.__dict__[i])

    # cleanup
    w.jobs.delete(job_id=created_job.job_id)


if __name__ == "__main__":
    main()

Collectives™ on Stack Overflow

How to create a Databricks job using a Python file outside of dbfs?

4 Answers 4

3 Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related