7

I have installed the databricks cli tool by running the following command

pip install databricks-cli using the appropriate version of pip for your Python installation. If you are using Python 3, run pip3.

Then by creating a PAT (personal-access token in Databricks) I run the following .sh bash script:

# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages

echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh                     dbfs:/databricks/packages/python_dependencies.sh

echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages

python_dependencies.sh script

#!/bin/bash
# Restart cluster after running.

sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"

I use the above script to install python libraries in the init-scripts of the cluster

enter image description here

My problem is that even though everything seems to be fine and the cluster is started successfully, the libraries are not installed properly. When I click on the libraries tab of the cluster I get this:

enter image description here Only 1 out of the 10 python libraries is installed.

Appreciate your help and comments.

5
  • Maybe your cluster is using a Python virtual environment. Commented Jun 22, 2020 at 14:07
  • Yeah I can confirm that. So you propose to install it like pip-install ...? My notebook runs Python and Spark code Commented Jun 22, 2020 at 14:09
  • You need to install your python packages into the same virtual env your cluster uses. That will mean "activating" that virt. env then installing packages via pip. Commented Jun 22, 2020 at 15:09
  • Thank you for the reply! It would be more helpful to me and to others looking at this question, if you could provide an example of what you saying with code. As I write in my question I need to use init script and not any other work around Commented Jun 22, 2020 at 15:58
  • @RedCricket pls check my comment above. Commented Jun 22, 2020 at 16:20

2 Answers 2

7

I have found the solution based on the comment of @RedCricket,

#!/bin/bash

pip install applicationinsights==0.11.9
pip install azure-servicebus==0.50.2
pip install azure-storage-file-datalake==12.0.0
pip install humanfriendly==8.2
pip install mlflow==1.8.0
pip install numpy==1.18.3
pip install opencensus-ext-azure==1.0.2
pip install packaging==20.4
pip install pandas==1.0.3
pip install --upgrade scikit-learn==0.22.2.post1

The above .sh file will install all the python dependencies referenced when the cluster is starting. So, the libraries won't have to be re-installed when the notebook is re-executed.

Sign up to request clarification or add additional context in comments.

Comments

3

For azure databricks as per documentation

https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/

# Set up authentication using an Azure AD token
export DATABRICKS_AAD_TOKEN=$(jq .accessToken -r <<< "$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d)")
# Databricks CLI configuration 
databricks configure --host "https://https://<databricks-instance>" --aad-token

now, copy script file to databricks file system

databricks fs cp "./cluster-scoped-init-scripts/db_scope_init_script.sh" "dbfs:/databricks/init-scripts/db_scope_init_script.sh"

Make sure "db_scope_init_script.sh" shell script has required installation commands.

Finally, Configure a cluster-scoped init script using the DBFS REST API

curl -n -X POST -H 'Content-Type: application/json' -d '{
  "cluster_id": "1202-211320-brick1",
  "num_workers": 1,
  "spark_version": "7.3.x-scala2.12",
  "node_type_id": "Standard_D3_v2",
  "cluster_log_conf": {
    "dbfs" : {
      "destination": "dbfs:/cluster-logs"
    }
  },
  "init_scripts": [ {
    "dbfs": {
      "destination": "dbfs:/databricks/scripts/db_scope_init_script.sh"
    }
  } ]
}' https://<databricks-instance>/api/2.0/clusters/edit

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.