I have a DataFrame with this schema:
root
|-- id: string (nullable = false)
|-- data_zone_array: array (nullable = true)
| |-- element: string (containsNull = true)
It actually contains an array data_zone_array containing several more or less predictable string values (or none at all), where their keys and values are separated by :; looking like this show(5) output:
id | data_zone_array
1 | ['name:john', 'surname:wick', 'group:a', 'group:b', 'group:c']
2 | ['name:joe', 'surname:linda', 'surname:boss', 'group:d', 'group:b']
3 | ['name:david', 'group:a', 'age:7']
4 | ['name:mary', 'surname:gilles']
5 | ['name:charles', 'surname:paul', 'group:d', 'group:b', 'group:c', 'age:6', 'unplanned_attribute_165:thisvalue']
I want to :
- Extract some of those values according to a list of keys (such as
nameandsurname) - knowing that their destination types are predictable (namewill always be a unique string andsurnamean array of strings) - Place all other found attributes in a struct containing string arrays. Note that there will be unpredictable attributes such as
unplanned_attribute_165.
It would give this kind of schema:
root
|-- id: string (nullable = false)
|-- name: string (nullable = true)
|-- surname: array (nullable = true)
| |-- element: string (containsNull = true)
|-- other_attributes: struct (nullable = true)
| |-- <attrx>: array (containsNull = true)
| | |-- element: string(containsNull = true)
| |-- <attry>: array (containsNull = true)
| | |-- element: string(containsNull = true)
| |-- ......
With records like:
id | name | surname | other_attributes
1 | 'john' | ['wick'] | {group:['a','b','c']}
2 | 'joe' | ['boss', 'linda'] | {group:['b', 'd']}
3 | 'david' | <null> | {group: ['a'], age:['7']}
4 | 'mary' | ['gilles'] | <null>
5 | 'charles' | ['paul'] | {group: ['b','c','d'], age:['6'], unplanned_attribute_165:['thisvalue']}
Any idea on how to perform such operations?