1

I have a DataFrame with this schema:

root
 |-- id: string (nullable = false)
 |-- data_zone_array: array (nullable = true)
 |    |-- element: string (containsNull = true)

It actually contains an array data_zone_array containing several more or less predictable string values (or none at all), where their keys and values are separated by :; looking like this show(5) output:

id  |   data_zone_array
1   |   ['name:john', 'surname:wick', 'group:a', 'group:b', 'group:c']
2   |   ['name:joe', 'surname:linda', 'surname:boss', 'group:d', 'group:b']
3   |   ['name:david', 'group:a', 'age:7']
4   |   ['name:mary', 'surname:gilles']
5   |   ['name:charles', 'surname:paul', 'group:d', 'group:b', 'group:c', 'age:6', 'unplanned_attribute_165:thisvalue']

I want to :

  • Extract some of those values according to a list of keys (such as name and surname) - knowing that their destination types are predictable (name will always be a unique string and surname an array of strings)
  • Place all other found attributes in a struct containing string arrays. Note that there will be unpredictable attributes such as unplanned_attribute_165.

It would give this kind of schema:

root
 |-- id: string (nullable = false)
 |-- name: string (nullable = true)
 |-- surname: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- other_attributes: struct (nullable = true)
 |    |-- <attrx>: array (containsNull = true)  
 |    |    |-- element: string(containsNull = true) 
 |    |-- <attry>: array (containsNull = true)
 |    |    |-- element: string(containsNull = true)
 |    |-- ......    

With records like:

id  |   name        |   surname             |   other_attributes
1   |   'john'      |   ['wick']            |   {group:['a','b','c']}
2   |   'joe'       |   ['boss', 'linda']   |   {group:['b', 'd']}
3   |   'david'     |   <null>              |   {group: ['a'], age:['7']}
4   |   'mary'      |   ['gilles']          |   <null>
5   |   'charles'   |   ['paul']            |   {group: ['b','c','d'], age:['6'], unplanned_attribute_165:['thisvalue']}

Any idea on how to perform such operations?

1 Answer 1

1

Here's one way of doing.

First, explode the column data_zone_array and extract keys and values into separate columns key and value by splitting on :.

Then, group by id and key and collect list of values associated with each key. And group by again by id to create map key -> [values].

Finally, select keys you want as columns and filter the reste of the keys from the map using map_keys + filter + transform to create the other_attributes column.

import pyspark.sql.functions as F

df1 = (df.withColumn("data_zone_array", F.explode("data_zone_array"))
       .withColumn("key", F.split("data_zone_array", ":")[0])
       .withColumn("value", F.split("data_zone_array", ":")[1])
       .groupBy("id", "key").agg(F.collect_list("value").alias("values"))
       .groupBy("id").agg(F.map_from_arrays(F.collect_list("key"), F.collect_list("values")).alias("attributes"))
       .select("id",
               F.col("attributes")["name"].alias("name"),
               F.col("attributes")["surname"].alias("surname"),
               F.expr("""transform(
                          filter(map_keys(attributes), k -> k not in('name', 'surname')),
                           x -> struct(x as key, attributes[x] as value)
                      )""").alias("other_attributes")
               )
       )

df1.show(truncate=False)
# +---+---------+-------------+------------------------------------------------------------------------+
# |id |name     |surname      |other_attributes                                                        |
# +---+---------+-------------+------------------------------------------------------------------------+
# |5  |[charles]|[paul]       |[{group, [d, b, c]}, {age, [6]}, {unplanned_attribute_165, [thisvalue]}]|
# |1  |[john]   |[wick]       |[{group, [a, b, c]}]                                                    |
# |3  |[david]  |null         |[{group, [a]}, {age, [7]}]                                              |
# |2  |[joe]    |[linda, boss]|[{group, [d, b]}]                                                       |
# |4  |[mary]   |[gilles]     |[]                                                                      |
# +---+---------+-------------+------------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your solution. I actually came across this way of doing for the early stages but there's something that bothers me: the combination of explode and collect_list might induce a serious performance issue in my opinion (pull me up if I am wrong). As data_zone_array contains in reality more than 15 different attributes, it would multiply the number of rows in the resulting dataset by at least 15 (more if there are different values for the same attribute) for an initial dataset of 1M lines to transform quickly. And the subsequent aggregation is an heavy operation as well.
I have actually tested this solution and works well ! I am just wondering if in the other_attributes array it would be possible to have a struct (key, value) of array instead of an array of struct (key,value). Such as {group: [d,b,c], age: [6]} instead of [{key:group, value:[d,b,c]}, {key:age, value:[6]}]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.