1

I have some json that's being read from a file where each row looks something like this:

    {
        "id": "someGuid",
        "data": {
            "id": "someGuid",
            "data": {
                "players": {
                    "player_1": {
                        "id": "player_1",
                        "locationId": "someGuid",
                        "name": "someName",
                        "assets": {
                            "assetId1": {
                                "isActive": true,
                                "playlists": {
                                    "someId1": true,
                                    "someOtherId1": false
                                }
                            },
                            "assetId2": {
                                "isActive": true,
                                "playlists": {
                                    "someId1": true
                                }
                            }
                        }
                    },
                    "player_2": {
                        "id": "player_2",
                        "locationId": "someGuid",
                        "name": "someName",
                        "dict": {
                            "assetId3": {
                                "isActive": true,
                                "playlists": {
                                    "someId1": true,
                                    "someOtherId1": false
                                }
                            },
                            "assetId4": {
                                "isActive": true,
                                "playlists": {
                                    "someId1": true
                                }
                            }
                        }
                    }
                }
            },
            "lastRefreshed": "2020-01-23T19:29:15.6354794Z",
            "expiresAt": "9999-12-31T23:59:59.9999999",
            "dataSourceId": "someId"
        }
    }

I'm having difficulty trying to figure out a way using python or sql in pyspark on Azure Databricks to turn this json into a tabular format like this:

+===========+=============+===============+===========+==============+=============+=================+
| Location  | Player_ID   |    Player     | Asset_ID  | Asset_Active | Playlist_ID | Playlist_Status |
+===========+=============+===============+===========+==============+=============+=================+
|  someId   | player_1    | ThisIsAPlayer | anotherId | TRUE         | someOtherId | FALSE           |
+-----------+-------------+---------------+-----------+--------------+-------------+-----------------+

The challenge is transforming the players property above to multiple rows per location. A location may have any number of players of varying ids. I perhaps would not be asking this question if the property players was an array of player objects instead of a dictionary, but i have no control over the structure of this document, so this is what I must work with. This is a non-issue in something like PowerBI, where the manipulation of the data is more straight forward.

The farthest I've been able to get is doing something like this:

df = spark.read.json(filePath).select("data.id", "data.lastRefreshed", "data.expiresAt","data.dataSourceId","data.data.players.*")

But this results in a dataframe/table that expands all the nested structs undernearth players to columns. I've scoured SO looking for someone with a similar situation, but no luck.

How do I go about exploding/expanding the players column in this dataframe to separate rows?

In pyspark, I'm dealing with Spark 2.4.3

1

1 Answer 1

2

You can try from_json function to convert the column/field from StructType into MapType, explode and then find your desired fields. for you example JSON, you will need to do this several times:

from pyspark.sql.functions import explode, from_json, to_json, json_tuple, coalesce

df.select(explode(from_json(to_json('data.data.players'),"map<string,string>"))) \
  .select(json_tuple('value', 'locationId', 'id', 'name', 'assets', 'dict').alias('Location', 'Player_ID', 'Player', 'assets', 'dict')) \
  .select('*', explode(from_json(coalesce('assets','dict'),"map<string,struct<isActive:boolean,playlists:string>>"))) \
  .selectExpr(
    'Location',
    'Player_ID',
    'Player', 
    'key as Asset_ID',
    'value.isActive',  
    'explode(from_json(value.playlists, "map<string,string>")) as (Playlist_ID, Playlist_Status)'
  ) \
.show()
+--------+---------+--------+--------+--------+------------+---------------+
|Location|Player_ID|  Player|Asset_ID|isActive| Playlist_ID|Playlist_Status|
+--------+---------+--------+--------+--------+------------+---------------+
|someGuid| player_1|someName|assetId1|    true|     someId1|           true|
|someGuid| player_1|someName|assetId1|    true|someOtherId1|          false|
|someGuid| player_1|someName|assetId2|    true|     someId1|           true|
|someGuid| player_2|someName|assetId3|    true|     someId1|           true|
|someGuid| player_2|someName|assetId3|    true|someOtherId1|          false|
|someGuid| player_2|someName|assetId4|    true|     someId1|           true|
+--------+---------+--------+--------+--------+------------+---------------+
Sign up to request clarification or add additional context in comments.

1 Comment

This was incredibly useful, and did solve my immediate problem. I don't know if I would have ever come up with a solution like this. How did you arrive at this answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.