6

I'm trying to read a csv that has the following data:

name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3

using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe

If I give my own schema like:

    schema = StructType([
    StructField('name', StringType()),
    StructField('date', TimestampType()),
    StructField('win', Booleantype()),
    StructField('stops', ArrayType(StringType())),
    StructField('cost', DoubleType())])

results in this exception:

pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.

so how would I properly read the csv without this failure?

2 Answers 2

8

Since csv doesn't support array, you need to first read as string, then convert it.

# You need to set escape option to ", since it is not the default escape character (\). 
df = spark.read.csv('file.csv', header=True, escape='"')

df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
Sign up to request clarification or add additional context in comments.

5 Comments

This worked so well. Can you explain the code though? Thankyou.
Sure, which part is confusing you?
The part where we added escape. I was not using escape before and one of the string columns was being cut-off. Does this convert it into json and then we are reading json back to string using from_json?
OP's csv has "[""x""]" in on of the column. string column with a special characters have to be wrapped with double quote, and then if you want to have a literal double quote between the wrapping quotes, you need to escape it. Most common escape would be using \ like "[\"x\"]". This is the default character, so doing spark.read.csv without escape option, it will read as string value ["x"]. However, OP's example is "[""x""]", so I need to set escape option to " and this will read as string value ["x"].
After the first line, ["x"] is a string value because csv does not support array column. In order to convert this to Array of String, I use from_json on the column to convert it. from_json takes string JSON and convert it to JSON object which can take a form of object or array. hope this makes sense but let me know if you still have questions.
-2

I guess this is what you are looking for:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()


dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")

dataframe.printSchema()

Let me know if it helps

9 Comments

updated the code a bit to serve better purpose.
There's syntax error in your code and no, it doesn't help. now even the column names aren't read properly. everything is a string and columns names are _c1, _c2 and so on.
what's the error? can you share?
delimiter='," mismatching quotes.
try above code now
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.