findspark.init() IndexError: list index out of range error

Question

When running the following in a Python 3.5 Jupyter environment I get the error below. Any ideas on what is causing it?

import findspark
findspark.init()

Error:

IndexError                                Traceback (most recent call
last) <ipython-input-20-2ad2c7679ebc> in <module>()
      1 import findspark
----> 2 findspark.init()
      3 
      4 import pyspark

/.../anaconda/envs/pyspark/lib/python3.5/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    132     # add pyspark to sys.path
    133     spark_python = os.path.join(spark_home, 'python')
--> 134     py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
    135     sys.path[:0] = [spark_python, py4j]
    136 

IndexError: list index out of range

gregoltsov · Accepted Answer · 2017-03-17 05:27:04Z

21

This is most likely due to the SPARK_HOME environment variable not being set correctly on your system. Alternatively, you can just specify it when you're initialising findspark, like so:

import findspark
findspark.init('/path/to/spark/home')

After that, it should all work!

answered Mar 17, 2017 at 5:27

gregoltsov

2,3093 gold badges23 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GPopat Over a year ago

This works, so thanks for solutions! Can you please also explain what these 2 lines does ?

ug2409 · Accepted Answer · 2017-04-05 10:56:16Z

9

I was getting the same error and was able to make it work by entering the exact installation directory:

import findspark
# Use this
findspark.init("C:\Users\PolestarEmployee\spark-1.6.3-bin-hadoop2.6")
# Test
from pyspark import SparkContext, SparkConf

Basically, it is the directory where spark was extracted. In future where ever you see spark_home enter the same installation directory. I also tried using toree to create a kernal instead, but it is failing somehow. A kernal would be a cleaner solution.

edited Apr 5, 2017 at 10:56

answered Apr 5, 2017 at 10:50

ug2409

3441 silver badge6 bronze badges

1 Comment

Molly Zhou Over a year ago

For me, I had to use "/" instead of "\" to make it work, i.e. findspark.init("C:/Users/....."). Not sure why tho...

Anurag Sharma · Accepted Answer · 2017-09-08 15:30:38Z

3

You need to update the SPARK_HOME variable inside bash_profile. For me, the following command worked(in terminal):

export SPARK_HOME="/usr/local/Cellar/apache-spark/2.2.0/libexec/"

After this, you can use follow these commands:

import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.2.0/libexec')

answered Sep 8, 2017 at 15:30

Anurag Sharma

761 silver badge5 bronze badges

2 Comments

D. Ror. Over a year ago

From the other solutions, this solution appears to be redundant. You shouldn't have to specify the path in both places.

val_to_many Over a year ago

I would have thought that, too. However, in my environment, I still have to specify the path when calling findspark.init().

nir · Accepted Answer · 2020-09-29 22:16:04Z

0

maybe this could help:

i found that findspark.init() tries to find data in .\spark-3.0.1-bin-hadoop2.7\bin\python\lib, but the python folder was outside the bin folder. i simply ran findspark.init('.\spark-3.0.1-bin-hadoop2.7'), without the '\bin' folder

edited Sep 29, 2020 at 22:16

answered Sep 29, 2020 at 22:09

nir

1171 silver badge5 bronze badges

Collectives™ on Stack Overflow

findspark.init() IndexError: list index out of range error

4 Answers 4

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related