1

I have the following json structure which is stored as a string in my hive table column

[  
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123002, "outer_field_1": blahblahblah,  "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
]

Now I want to parse this into a table the elements of the outer array as each row. The fields of each json object is parsed as each column, while still keeping inner list as string:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

Now I am aware of the trick to use regex to create a custom separator, split on it then use lateral view explode, but in this case, there are also nested arrays which will match the regex: Parse json arrays using HIVE

Any ideas on how to do this? I want to do this in raw Spark-SQL if possible. No UDFs or Serdes.

What I have tried:

  1. select explode(get_json_object(outer_list, "$[*]")) from wt_test;

doesnt work, it says input to function explode should be array or map type, not string

  1. select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;

this splits every single comma into a row which is not what we want:

{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...
6
  • could you maybe edit your desired output a little bit more. I am not sure what you expect. Commented Oct 30, 2020 at 8:49
  • @Matt edited for clarity Commented Oct 30, 2020 at 9:00
  • so inner_list is not important? Commented Oct 30, 2020 at 9:15
  • it is, but at this point I just want to explode the outer list, the whole requirements is kinda complicated... Let me edit the question to still keep the inner list as a string Commented Oct 30, 2020 at 9:21
  • ahhh sry I didnt fully read it so basicly you have a table like this right: {"a": 1, "outer_list":[{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": blahblahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, ]} Commented Oct 30, 2020 at 9:26

1 Answer 1

1

Assuming I did understand correctly you have the following:

Input

{
   "some_id":1,
   "outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}

Desired Output:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

First of all you need to parse the string into a schema for that define a schema:

schema = ArrayType(
   StructType([StructField('outer_id', IntegerType()), 
               StructField('outer_field_1', StringType()), 
               StructField('inner_list', StringType())])
)

Notice this is the simple version where Inner_List is just taken as string.

Apply that schema on your dataframe:

df = df.select(from_json('outer_list', schema).alias('test'))

Now you have a column with an array:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

this you can explode now:

df.select(explode('test').alias('exploded')).select('exploded.*')

Which makes it:

+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list                         |
+--------+-------------+-----------------------------------+
|123000  |blah         |[{"inner_id":456},{"inner_id":789}]|
|123001  |blahblah     |[{"inner_id":456},{"inner_id":789}]|
|123002  |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+

Now while you are anyway parsing outer_list you can from the beginning do the same with inner_list. But you should maybe try that yourself first, you have everything here what you need for that.

Dont forget to import:

from pyspark.sql.functions import *
from pyspark.sql.types import *

SQL Version if the input is given as table json_test:

select exploded.* from 
   (select explode(
             from_json(
               outer_list, 
    "array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
       )
    ) as exploded from json_test
)
Sign up to request clarification or add additional context in comments.

5 Comments

Hello Matt, thanks for your answer. If I am using pyspark or scala this would all be easy, but I would like to do this in raw SQL, which means the "SELECT * FROM xxx" kind of queries you fire into the CLI
Hi @Matt the input is given as json string, so the conversion need to be done with pure SQL as well
This is pure sql see the end of my answer where i use sql to parse json i am not sure what you need more
Hi @matt, the above SQL is not working for me. I wrote the following in my spark-sql query and its throwing me errors: ``` select from_json("[{\"outer_id\": 123000, \"outer_field_1\": \"blah\", \"inner_list\": [{\"inner_id\": 456}, {\"inner_id\": 789}]}]", "array<struct<outer_id:int,outer_field_1:string>>") ``` Seems like a syntax error with from_json: == SQL == array<struct<outer_id:int,outer_field_1:string>> -----^^^ ; line 1 pos 7
I think we need to add a root key and we are good, see: stackoverflow.com/questions/41107835/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.