I have a dataframe with schema
root
|-- properties: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
An example of 3 rows from the df appears as:
[]
[{"key":"id", "value":"as143"},
{"key":"user", "value":"John Doe"},
{"key":"email", "value":"[email protected]"},
{"key":"location", "value":"KY, USA"}]
[{"key":"id", "value":"bd143"},
{"key":"user", "value":"Adam Smith"},
{"key":"email", "value":"[email protected]"}]
So each new user is in a new row in the df. The amount of data in each row can be different with null values being allowed.
I would like to make a new df such that each key would be the new column name and the values in the column would be all the corresponding values of that key. For example, I would have a column titled 'user' and each row would have a different user name.
I have tried to access the sub-fields using this code but I get an error.
keys = table.select('properties.key').distinct().collect()[0][0]
table.withColumn('value', split(table.properties.value, ',')).
select(col('value'[0].alias(keys[0]),
select(col('value'[1].alias(keys[1]),
select(col('value'[2].alias(keys[2]),
select(col('value'[3].alias(keys[3])).display()
I have also tried to create a key, value map because I need to do these for other columns in the dataframe and for some of them, it is hard to tell have many key/value pairs there are since they are allowed to be null values so I may use pyspark.sql.functions.from_json.
I feel this is the preferred way, however, I have not had success with that either. I have not been able to convert to a map and I think it is because rather than having "key":"value" pairs I have "key": "key 1", "value": "value 1", "key": "key 2", "value": "value 2", "key": "key 3", "value": "value 3", etc all in the same row.
This is the code I used:
table.withColumn('properties', from_json(table.properties, MapType(StringType(), StringType())))
and I get the error:
cannot resolve 'entries' due to data type mismatch: argument 1 requires string type, however 'table.properties' is of array<struct<key:string, value:string>> type.
I am not exactly sure how to go about converting an array type like this to a map.
propertiescontain multiple users? If so, can you show example? I think this can be done using some pivot after exploding the array but It's not clear on how you can have same key.