Using Spark Scala I am trying to extract an array of Struct from parquet. The input is a parquet file. The output is a csv file. The field of the csv can have "multi-values" delimited by "#;". The csv is delimited by ",". What is the best way to accomplish this?
Schema
root
|-- llamaEvent: struct (nullable = true)
| |-- activity: struct (nullable = true)
| | |-- Animal: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- time: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | | | |-- llamaType: string (nullable = true)
Example Input as json (the input will be parquet)
{
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}
Desired CSV Output
time,status,llamaType
5-1-2020#;6-2-2020,running#;sitting,red llama#;blue llama
Update: Based on some trial and error, I believe a solution like this maybe appropriate depending on use case. This does a "short cut" by grabbing the array item, cast it to string, then parse out extraneous characters, which is good for some use cases.
df.select(col("llamaEvent.activity").getItem("Animal").getItem("time")).cast("String"))
Then you can perform whatever parsing you want after such as regexp_replace
df.withColumn("time", regexp_replace(col("time"),",",";#"))
Several appropriate solutions were also proposed using groupby, explode, aggregate as well.