I have initially a DataFrame like follows:
Key Emails PassportNum Age
0001 [Alan@gmail,Alan@hotmail] passport1 23
0002 [Ben@gmail,Ben@hotmail] passport2 28
I need to apply a function over each Email, something dummy like add "_2" at the end for example, the operation is not relevant. So I will explode this column like this:
val dfExplode = df.withColumn("Email",explode($"Emails")).drop("Emails")
Now I will have a dataframe like this:
Key Email PassportNum Age
0001 Alan@gmail passport1 23
0001 Alan@hotmail passport1 23
0002 Ben@gmail passport2 28
0002 Ben@hotmail passport2 28
I apply any change on passports and then what I want to have is again this:
Key Emails PassportNum Age
0001 [Alan_2@gmail,Alan_2@hotmail] passport1 23
0002 [Ben_2@gmail,Ben_2@hotmail] passport2 28
The option I was considering was this:
dfOriginal = dfExploded.groupBy("Key","PassportNum","Age").agg(collect_set("Email").alias("Emails"))
In this case it may not be such a bad approach. But in my real case I perform the explode over a single column and I have another 20 columns like PassportNum, Age... which are going to be duplicated.
This means that I will need to add around 20 columns in the groupBy, when I really can perform the group by over a single one, for example Key which is unique.
I was thinking to add this columns in the agg as well like this:
dfOriginal = dfExploded.groupBy("Key").agg(collect_set("Email").alias("Emails"),collect_set("PassportNum"),collect_set("Age"))
But I don't want them to be in a single element array.
Is it any way to make an aggregate without any collect_*? Is there any simpler approach to undo the explode?
firstforPassportNumandAgesince they will have the same values anyway after the explode ?