I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?