0

I would need to process a json file with the following schema:

root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)

When I load it into a DataFrame there is only one Row, but that row contains plenty of data and event elements in the Payload array. (One element has either data or event but never both)

I would like to get all the events so I could perform some further operations on them or maybe loading them later in a DB table etc... In order to do that I will need all the elements of Payload that has Event and I won't need the one that only has "Data". Best would be to have a DataFrame in the end that contains rows with only the members of Event...

Unfortunately when I tried something like this:

df.select("Payload.Event") or df.select(`Payload`).filter(...)

Then it was still filtering on the root but since there is only one row in the DataFrame that was not very helpful. How can I filter the internal array, and get its elements as a separate Dataframe?

Sample json:

{
    "Header": {
        "Version": {
            "vfield": "0.6"
        },
        "Format": "DEFAULT"
    },
    "Payload": [
        {"Data": [
            [0, 1, 2],
            [5, 6]
        ]},

        {"Event": {
            "eventName" : "event1",
            "eventCount": 123
        }},
        {"Event": {
            "eventName" : "event2",
            "eventCount": 124
        }},
        { "Data": [
            [5,8],
            [1,2,6]
        ] }
    ]        
}    
2
  • can you update your sample json data? Commented Jul 18, 2020 at 13:54
  • Have you tried searching SO for "json explode"? For example dataframe Spark scala explode json array Commented Jul 18, 2020 at 14:22

1 Answer 1

2

Because Payload is of type array, If you access anything without explode will give you result of type array

Change df.select("Payload.Event") to df.withColumn("Payload",explode("Payload")).select("Payload.Event")

Check below code.

scala> df.printSchema
root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event").printSchema
root
 |-- Event: struct (nullable = true)
 |    |-- eventCount: long (nullable = true)
 |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event.*").printSchema
root
 |-- eventCount: long (nullable = true)
 |-- eventName: string (nullable = true)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you that did the trick, additionally I added the where so the nulls disappeared as well: df.withColumn("Payload",explode(df.col("Payload"))).where(!isnull($"Payload.Event")).select("Payload.Event.*")
Please accept or upvote if this solution helps you .. :)
Hmm. already accepted it, not sure why the tick disappeared. I try again :) Unfortunately I'm a newbie, I can't upvote :(
let me upvote on OP's behalf. :) I indeed like this solution @Srinivas

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.