Read csv that contains array of string in pyspark

Question

I'm trying to read a csv that has the following data:

name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3

using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe

If I give my own schema like:

    schema = StructType([
    StructField('name', StringType()),
    StructField('date', TimestampType()),
    StructField('win', Booleantype()),
    StructField('stops', ArrayType(StringType())),
    StructField('cost', DoubleType())])

results in this exception:

pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.

so how would I properly read the csv without this failure?

Emma · Accepted Answer · 2022-04-22 15:15:58Z

8

Since csv doesn't support array, you need to first read as string, then convert it.

# You need to set escape option to ", since it is not the default escape character (\). 
df = spark.read.csv('file.csv', header=True, escape='"')

df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))

answered Apr 22, 2022 at 15:15

Emma

9,5781 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

SarwatFatimaM Over a year ago

This worked so well. Can you explain the code though? Thankyou.

Emma Over a year ago

Sure, which part is confusing you?

SarwatFatimaM Over a year ago

The part where we added escape. I was not using escape before and one of the string columns was being cut-off. Does this convert it into json and then we are reading json back to string using from_json?

Emma Over a year ago

OP's csv has "[""x""]" in on of the column. string column with a special characters have to be wrapped with double quote, and then if you want to have a literal double quote between the wrapping quotes, you need to escape it. Most common escape would be using \ like "[\"x\"]". This is the default character, so doing spark.read.csv without escape option, it will read as string value ["x"]. However, OP's example is "[""x""]", so I need to set escape option to " and this will read as string value ["x"].

Emma Over a year ago

After the first line, ["x"] is a string value because csv does not support array column. In order to convert this to Array of String, I use from_json on the column to convert it. from_json takes string JSON and convert it to JSON object which can take a form of object or array. hope this makes sense but let me know if you still have questions.

sargupta · Accepted Answer · 2022-04-22 14:22:06Z

-2

I guess this is what you are looking for:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()


dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")

dataframe.printSchema()

Let me know if it helps

edited Apr 22, 2022 at 14:22

answered Apr 22, 2022 at 14:03

sargupta

1,03316 silver badges28 bronze badges

9 Comments

sargupta Over a year ago

updated the code a bit to serve better purpose.

yuser099881232 Over a year ago

There's syntax error in your code and no, it doesn't help. now even the column names aren't read properly. everything is a string and columns names are _c1, _c2 and so on.

sargupta Over a year ago

what's the error? can you share?

yuser099881232 Over a year ago

delimiter='," mismatching quotes.

sargupta Over a year ago

try above code now

|

Collectives™ on Stack Overflow

Read csv that contains array of string in pyspark

2 Answers 2

5 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related