1

This is the code i used:

df = None

from pyspark.sql.functions import lit

for category in file_list_filtered:
    data_files = os.listdir('HMP_Dataset/'+category)

    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
        temp_df = temp_df.withColumn('class', lit(category))
        temp_df = temp_df.withColumn('source', lit(data_file))

        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

and i got this error:

NameError                                 Traceback (most recent call last)
<ipython-input-4-4296b4e97942> in <module>
      9     for data_file in data_files:
     10         print(data_file)
---> 11         temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
     12         temp_df = temp_df.withColumn('class', lit(category))
     13         temp_df = temp_df.withColumn('source', lit(data_file))

NameError: name 'spark' is not defined

How can i solve it?

0

2 Answers 2

1

Initialize Spark Session then use spark in your loop.

df = None

from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app_name').getOrCreate()

for category in file_list_filtered:
...
Sign up to request clarification or add additional context in comments.

3 Comments

NameError Traceback (most recent call last) <ipython-input-17-5737e4577c66> in <module> 10 for data_file in data_files: 11 print(data_file) ---> 12 temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema) 13 temp_df = temp_df.withColumn('class', lit(category)) 14 temp_df = temp_df.withColumn('source', lit(data_file)) NameError: name 'schema' is not defined
@ParamitaBhattacharjee, you are reading the csv file with schema, So need to define the schema stackoverflow.com/a/56504339 (or) you can remove schema=schema from spark.read.csv.
thanks actually i am using jupyter notebook so i am getting so many errors but if i do the same in google colab it works fine thank you
1

try defining spark var

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.