2

I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:

ModuleNotFoundError: No module named 'numpy'

My code:

import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession

spark = SparkSession\
      .builder\
      .appName("Playground")\
      .config("spark.master", "local")\
      .getOrCreate()

@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y)


df = spark\
    .createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
    .withColumnRenamed("_1","doc")\
    .withColumnRenamed("_2","word1")\
    .withColumnRenamed("_3","word2")\
    .withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))

df2.show() # The exception is thrown here

However if I run a different file that only includes:

import numpy as np
x = np.array([1, 3, 9])

then it works fine.

Edit:

As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.

I run it like that:

spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt

requirements.txt:

numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1

But it fails on the first line.

I get the same error for both venv and conda.

15
  • try to cast the return type of your function: return float(cosine(x,y)) Commented Oct 3, 2019 at 23:28
  • 1
    @jxc his code will fail at the first line itself, don't you think? Commented Oct 4, 2019 at 4:51
  • What is the directory of the python executable that pyspark uses? Have you installed numpy in it's site-packages? Commented Oct 4, 2019 at 4:53
  • @jxc it didn't help. Commented Oct 4, 2019 at 6:14
  • @pissall yes I can see numpy under external libraries/python 3.7/site-packages. However numpy doesn't exist in the site-packages under project/venv/lib/site-packages. Maybe it has something to do with this? Commented Oct 4, 2019 at 10:42

1 Answer 1

3
+25

It looks like numpy is installed on a different runtime than the one used by Spark. You can tell what runtime to use to spark by setting environment variable PYSPARK_PYTHON.

In the spark configuration file, (conf/spark-env.sh in spark's installation dir. Not sure about windows, but spark distribution contains spark-env.sh.template - spark-env.cmd.template on Windows I think-. It must be renamed to spark-env.sh (spark-env.cmd) first.)

PYSPARK_PYTHON=<path to your python runtime/executable>

You can read more about environment variables in the docs.

Sign up to request clarification or add additional context in comments.

2 Comments

Regarding PYSPARK_PYTHON, I've tried it, it didn't work, but it uses the correct SPARK_HOME location anyway. Maby I should install numpy inside SPARK_HOME somehow?
Regarding spark-env.cmd.template, I didn't understand what you wanted me to do with this. The whole file is in comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.