Pyspark Split array of 'key:value' string elements to a struct and extract some values when found

Question

I have a DataFrame with this schema:

root
 |-- id: string (nullable = false)
 |-- data_zone_array: array (nullable = true)
 |    |-- element: string (containsNull = true)

It actually contains an array data_zone_array containing several more or less predictable string values (or none at all), where their keys and values are separated by :; looking like this show(5) output:

id  |   data_zone_array
1   |   ['name:john', 'surname:wick', 'group:a', 'group:b', 'group:c']
2   |   ['name:joe', 'surname:linda', 'surname:boss', 'group:d', 'group:b']
3   |   ['name:david', 'group:a', 'age:7']
4   |   ['name:mary', 'surname:gilles']
5   |   ['name:charles', 'surname:paul', 'group:d', 'group:b', 'group:c', 'age:6', 'unplanned_attribute_165:thisvalue']

I want to :

Extract some of those values according to a list of keys (such as name and surname) - knowing that their destination types are predictable (name will always be a unique string and surname an array of strings)
Place all other found attributes in a struct containing string arrays. Note that there will be unpredictable attributes such as unplanned_attribute_165.

It would give this kind of schema:

root
 |-- id: string (nullable = false)
 |-- name: string (nullable = true)
 |-- surname: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- other_attributes: struct (nullable = true)
 |    |-- <attrx>: array (containsNull = true)  
 |    |    |-- element: string(containsNull = true) 
 |    |-- <attry>: array (containsNull = true)
 |    |    |-- element: string(containsNull = true)
 |    |-- ......

With records like:

id  |   name        |   surname             |   other_attributes
1   |   'john'      |   ['wick']            |   {group:['a','b','c']}
2   |   'joe'       |   ['boss', 'linda']   |   {group:['b', 'd']}
3   |   'david'     |   <null>              |   {group: ['a'], age:['7']}
4   |   'mary'      |   ['gilles']          |   <null>
5   |   'charles'   |   ['paul']            |   {group: ['b','c','d'], age:['6'], unplanned_attribute_165:['thisvalue']}

Any idea on how to perform such operations?

blackbishop · Accepted Answer · 2021-11-15 17:04:20Z

1

Here's one way of doing.

First, explode the column data_zone_array and extract keys and values into separate columns key and value by splitting on :.

Then, group by id and key and collect list of values associated with each key. And group by again by id to create map key -> [values].

Finally, select keys you want as columns and filter the reste of the keys from the map using map_keys + filter + transform to create the other_attributes column.

import pyspark.sql.functions as F

df1 = (df.withColumn("data_zone_array", F.explode("data_zone_array"))
       .withColumn("key", F.split("data_zone_array", ":")[0])
       .withColumn("value", F.split("data_zone_array", ":")[1])
       .groupBy("id", "key").agg(F.collect_list("value").alias("values"))
       .groupBy("id").agg(F.map_from_arrays(F.collect_list("key"), F.collect_list("values")).alias("attributes"))
       .select("id",
               F.col("attributes")["name"].alias("name"),
               F.col("attributes")["surname"].alias("surname"),
               F.expr("""transform(
                          filter(map_keys(attributes), k -> k not in('name', 'surname')),
                           x -> struct(x as key, attributes[x] as value)
                      )""").alias("other_attributes")
               )
       )

df1.show(truncate=False)
# +---+---------+-------------+------------------------------------------------------------------------+
# |id |name     |surname      |other_attributes                                                        |
# +---+---------+-------------+------------------------------------------------------------------------+
# |5  |[charles]|[paul]       |[{group, [d, b, c]}, {age, [6]}, {unplanned_attribute_165, [thisvalue]}]|
# |1  |[john]   |[wick]       |[{group, [a, b, c]}]                                                    |
# |3  |[david]  |null         |[{group, [a]}, {age, [7]}]                                              |
# |2  |[joe]    |[linda, boss]|[{group, [d, b]}]                                                       |
# |4  |[mary]   |[gilles]     |[]                                                                      |
# +---+---------+-------------+------------------------------------------------------------------------+

answered Nov 15, 2021 at 17:04

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cwellan Over a year ago

Thank you for your solution. I actually came across this way of doing for the early stages but there's something that bothers me: the combination of explode and collect_list might induce a serious performance issue in my opinion (pull me up if I am wrong). As data_zone_array contains in reality more than 15 different attributes, it would multiply the number of rows in the resulting dataset by at least 15 (more if there are different values for the same attribute) for an initial dataset of 1M lines to transform quickly. And the subsequent aggregation is an heavy operation as well.

Cwellan Over a year ago

I have actually tested this solution and works well ! I am just wondering if in the other_attributes array it would be possible to have a struct (key, value) of array instead of an array of struct (key,value). Such as {group: [d,b,c], age: [6]} instead of [{key:group, value:[d,b,c]}, {key:age, value:[6]}]

Collectives™ on Stack Overflow

Pyspark Split array of 'key:value' string elements to a struct and extract some values when found

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related