1

I have a schema of this form from a json file:

root
 |-- fruit_id: string (nullable = true)
 |-- fruit_type: array (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- info: struct (nullable = true)
 |         |-- fruit_quality: array (nullable = true)
 |         |    |-- quality: string (nullable = true)
 |         |-- likes: string (containsNull = true)    
 |    |-- finance: struct (nullable = true)
 |    |    |-- last_year_price: string (nullable = true)
 |    |    |-- current_price: string (nullable = true)
 |    |-- shops: struct (nullable = true)
 |    |    |-- shop1: string (nullable = true)
 |    |    |-- shop2: string (nullable = true)
 |-- season: string (nullable = true)

How can I get it of this form?

root
 |-- fruit_id: string (nullable = true)
 |-- fruit_type_name: string (nullable = true)
 |-- fruit_type_info_fruit_quality_quality: string (nullable = true)
 |-- fruit_type_info_likes: string (nullable = true)
 |-- fruit_type_finance_last_year_price: string (nullable = true)
 |-- fruit_type_finance_current_price: string (nullable = true)
 |-- fruit_type_shops_shop1: string (nullable = true)
 |-- fruit_type_shops_shop2: string (nullable = true)
 |-- season: string (nullable = true)

This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables ?

I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link

I also added this piece of code to code on above link, to see if this approach would work:

import pyspark.sql.functions as F

 array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
 df = df.select(
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in array_cols
                                for c in df.select(nc+'.*').columns])

But it's not working.

I then checked this link as well: link

But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.

Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.

Lastly, I checked this link as well: link

But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.

So how can I solve this?

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.