Convert / Cast StructType, ArrayType to StringType (Single Valued) using pyspark

Question

One of my Dataframe(spark.sql) has this schema.

root
 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |    |-- abc: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a0: string (nullable = true)
 |    |    |    |-- a1: string (nullable = true)
 |    |    |    |-- a2: string (nullable = true)
 |    |    |    |-- a3: string (nullable = true)
 |-- ValueC: struct (nullable = true)
 |    |-- pqr: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- info1: string (nullable = true)
 |    |    |    |-- info2: struct (nullable = true)
 |    |    |    |    |-- x1: long (nullable = true)
 |    |    |    |    |-- x2: long (nullable = true)
 |    |    |    |    |-- x3: string (nullable = true)
 |    |    |    |-- info3: string (nullable = true)
 |    |    |    |-- info4: string (nullable = true)
 |-- Value4: struct (nullable = true)
 |    |-- xyz: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- b0: string (nullable = true)
 |    |    |    |-- b2: string (nullable = true)
 |    |    |    |-- b3: string (nullable = true)
 |-- Value5: string (nullable = true)

I need to save this to CSV file but without using any flatten, explode in the below format.

 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |-- ValueC: struct (nullable = true)
 |-- ValueD: struct (nullable = true)
 |-- ValueE: string (nullable = true)

I have Directly used the command [df.to_pandas().to_csv("output.csv")] this serves my purpose, but I need a better approach. I am using pyspark

notNull · Accepted Answer · 2020-07-09 20:07:38Z

1

In Spark writing csv format doesn't support writing struct/array..etc complex types yet.

Write as Parquet file:

Better approach in Spark would be writing as parquet format, As parquet format supports all the nested data types and provides better performance while reading/writing.

df.write.parquet("<path>")

Write as Json file:

In case writing in json format accepted then

df.write.json("path")
#or
df.toJSON().saveAsTextFile("path")

Write as CSV file:

Use to_json function which converts json struct/Array to string and store as csv format.

df.selectExpr("valueA","to_json(ValueB)"..etc).write.csv("path")

answered Jul 9, 2020 at 20:07

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vipul Gaur Over a year ago

Thank you for your response @Shu. Well, I only want the format in (CSV). I have tried doing 'to_json(ValueB)'. This saves the file as CSV but the created file has unwantedly broken up columns. I guess those are from the to_json format.

Collectives™ on Stack Overflow

Convert / Cast StructType, ArrayType to StringType (Single Valued) using pyspark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related