getting error name 'spark' is not defined

Question

This is the code i used:

df = None

from pyspark.sql.functions import lit

for category in file_list_filtered:
    data_files = os.listdir('HMP_Dataset/'+category)

    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
        temp_df = temp_df.withColumn('class', lit(category))
        temp_df = temp_df.withColumn('source', lit(data_file))

        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

and i got this error:

NameError                                 Traceback (most recent call last)
<ipython-input-4-4296b4e97942> in <module>
      9     for data_file in data_files:
     10         print(data_file)
---> 11         temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
     12         temp_df = temp_df.withColumn('class', lit(category))
     13         temp_df = temp_df.withColumn('source', lit(data_file))

NameError: name 'spark' is not defined

How can i solve it?

notNull · Accepted Answer · 2020-05-07 14:22:11Z

1

Initialize Spark Session then use spark in your loop.

df = None

from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app_name').getOrCreate()

for category in file_list_filtered:
...

answered May 7, 2020 at 14:22

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Paramita Bhattacharjee Over a year ago

NameError Traceback (most recent call last) <ipython-input-17-5737e4577c66> in <module> 10 for data_file in data_files: 11 print(data_file) ---> 12 temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema) 13 temp_df = temp_df.withColumn('class', lit(category)) 14 temp_df = temp_df.withColumn('source', lit(data_file)) NameError: name 'schema' is not defined

notNull Over a year ago

@ParamitaBhattacharjee, you are reading the csv file with schema, So need to define the schema stackoverflow.com/a/56504339 (or) you can remove schema=schema from spark.read.csv.

Paramita Bhattacharjee Over a year ago

thanks actually i am using jupyter notebook so i am getting so many errors but if i do the same in google colab it works fine thank you

QuickSilver · Accepted Answer · 2020-05-07 14:19:58Z

1

try defining spark var

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

answered May 7, 2020 at 14:19

QuickSilver

4,0452 gold badges15 silver badges31 bronze badges

Collectives™ on Stack Overflow

getting error name 'spark' is not defined

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related