2

I'm trying to merge the data from a dataset as follow:

id sms longDescription OtherFields
123 contentSms ContentDesc xxx
123 contentSms2 ContentDesc2 xxx
123 contentSms3 ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

the sms and longDescription have the following structure:

sms:array
|----element:struct
      |----content:string
      |----languageId:string

The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):

id sms longDescription OtherFields
123 contentSms, ContentSms2,contentSms3 ContentDesc,ContentDesc2,ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

I've tried using

x = df.select("*").groupBy("id").agg( collect_list("sms"))

but the result is :

collect_list(longDescription): array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |    |-- languageId: string (nullable = true)

which is an array too much, as the goal is to have an array of struct in order to have the following result:

sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]

1 Answer 1

3

You're looking for flatten function:

x = df.groupBy("id").agg(flatten(collect_list("sms")))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.