Merging column with array from multiple rows

Question

I'm trying to merge the data from a dataset as follow:

id	sms	longDescription	OtherFields
123	contentSms	ContentDesc	xxx
123	contentSms2	ContentDesc2	xxx
123	contentSms3	ContentDesc3	xxx
456	contentSms4	ContentDesc	xxx

the sms and longDescription have the following structure:

sms:array
|----element:struct
      |----content:string
      |----languageId:string

The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):

id	sms	longDescription	OtherFields
123	contentSms, ContentSms2,contentSms3	ContentDesc,ContentDesc2,ContentDesc3	xxx
456	contentSms4	ContentDesc	xxx

I've tried using

x = df.select("*").groupBy("id").agg( collect_list("sms"))

but the result is :

collect_list(longDescription): array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |    |-- languageId: string (nullable = true)

which is an array too much, as the goal is to have an array of struct in order to have the following result:

sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]

blackbishop · Accepted Answer · 2022-01-20 14:57:42Z

3

You're looking for flatten function:

x = df.groupBy("id").agg(flatten(collect_list("sms")))

answered Jan 20, 2022 at 14:57

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Merging column with array from multiple rows

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related