0

One of my Dataframe(spark.sql) has this schema.

root
 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |    |-- abc: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a0: string (nullable = true)
 |    |    |    |-- a1: string (nullable = true)
 |    |    |    |-- a2: string (nullable = true)
 |    |    |    |-- a3: string (nullable = true)
 |-- ValueC: struct (nullable = true)
 |    |-- pqr: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- info1: string (nullable = true)
 |    |    |    |-- info2: struct (nullable = true)
 |    |    |    |    |-- x1: long (nullable = true)
 |    |    |    |    |-- x2: long (nullable = true)
 |    |    |    |    |-- x3: string (nullable = true)
 |    |    |    |-- info3: string (nullable = true)
 |    |    |    |-- info4: string (nullable = true)
 |-- Value4: struct (nullable = true)
 |    |-- xyz: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- b0: string (nullable = true)
 |    |    |    |-- b2: string (nullable = true)
 |    |    |    |-- b3: string (nullable = true)
 |-- Value5: string (nullable = true)

I need to save this to CSV file but without using any flatten, explode in the below format.

 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |-- ValueC: struct (nullable = true)
 |-- ValueD: struct (nullable = true)
 |-- ValueE: string (nullable = true)

I have Directly used the command [df.to_pandas().to_csv("output.csv")] this serves my purpose, but I need a better approach. I am using pyspark

1 Answer 1

1

In Spark writing csv format doesn't support writing struct/array..etc complex types yet.

Write as Parquet file:

Better approach in Spark would be writing as parquet format, As parquet format supports all the nested data types and provides better performance while reading/writing.

df.write.parquet("<path>")

Write as Json file:

In case writing in json format accepted then

df.write.json("path")
#or
df.toJSON().saveAsTextFile("path")

Write as CSV file:

Use to_json function which converts json struct/Array to string and store as csv format.

df.selectExpr("valueA","to_json(ValueB)"..etc).write.csv("path")
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your response @Shu. Well, I only want the format in (CSV). I have tried doing 'to_json(ValueB)'. This saves the file as CSV but the created file has unwantedly broken up columns. I guess those are from the to_json format.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.