install python packages using init scripts in a databricks cluster

Question

I have installed the databricks cli tool by running the following command

pip install databricks-cli using the appropriate version of pip for your Python installation. If you are using Python 3, run pip3.

Then by creating a PAT (personal-access token in Databricks) I run the following .sh bash script:

# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages

echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh                     dbfs:/databricks/packages/python_dependencies.sh

echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages

python_dependencies.sh script

#!/bin/bash
# Restart cluster after running.

sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"

I use the above script to install python libraries in the init-scripts of the cluster

My problem is that even though everything seems to be fine and the cluster is started successfully, the libraries are not installed properly. When I click on the libraries tab of the cluster I get this:

Only 1 out of the 10 python libraries is installed.

Appreciate your help and comments.

Yeah I can confirm that. So you propose to install it like pip-install ...? My notebook runs Python and Spark code — NikSp
– NikSp, Commented Jun 22, 2020 at 14:09
You need to install your python packages into the same virtual env your cluster uses. That will mean "activating" that virt. env then installing packages via pip. — Red Cricket
– Red Cricket, Commented Jun 22, 2020 at 15:09
Thank you for the reply! It would be more helpful to me and to others looking at this question, if you could provide an example of what you saying with code. As I write in my question I need to use init script and not any other work around — NikSp
– NikSp, Commented Jun 22, 2020 at 15:58

NikSp · Accepted Answer · 2020-06-23 11:38:58Z

7

I have found the solution based on the comment of @RedCricket,

#!/bin/bash

pip install applicationinsights==0.11.9
pip install azure-servicebus==0.50.2
pip install azure-storage-file-datalake==12.0.0
pip install humanfriendly==8.2
pip install mlflow==1.8.0
pip install numpy==1.18.3
pip install opencensus-ext-azure==1.0.2
pip install packaging==20.4
pip install pandas==1.0.3
pip install --upgrade scikit-learn==0.22.2.post1

The above .sh file will install all the python dependencies referenced when the cluster is starting. So, the libraries won't have to be re-installed when the notebook is re-executed.

edited Jun 23, 2020 at 11:38

answered Jun 23, 2020 at 10:51

NikSp

1,5894 gold badges26 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sagar Kulkarni · Accepted Answer · 2021-11-19 17:06:58Z

For azure databricks as per documentation

https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/

# Set up authentication using an Azure AD token
export DATABRICKS_AAD_TOKEN=$(jq .accessToken -r <<< "$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d)")
# Databricks CLI configuration 
databricks configure --host "https://https://<databricks-instance>" --aad-token

now, copy script file to databricks file system

databricks fs cp "./cluster-scoped-init-scripts/db_scope_init_script.sh" "dbfs:/databricks/init-scripts/db_scope_init_script.sh"

Make sure "db_scope_init_script.sh" shell script has required installation commands.

Finally, Configure a cluster-scoped init script using the DBFS REST API

curl -n -X POST -H 'Content-Type: application/json' -d '{
  "cluster_id": "1202-211320-brick1",
  "num_workers": 1,
  "spark_version": "7.3.x-scala2.12",
  "node_type_id": "Standard_D3_v2",
  "cluster_log_conf": {
    "dbfs" : {
      "destination": "dbfs:/cluster-logs"
    }
  },
  "init_scripts": [ {
    "dbfs": {
      "destination": "dbfs:/databricks/scripts/db_scope_init_script.sh"
    }
  } ]
}' https://<databricks-instance>/api/2.0/clusters/edit

Collectives™ on Stack Overflow

install python packages using init scripts in a databricks cluster

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related