3

I'm using Java Spark and I have 1 Dataframe like this

+---+-----+------+
|id |color|datas |
+----------------+
|1  |blue  |data1|
|1  |red   |data2|
|1  |orange|data3|
|2  |black |data4|
|2  |      |data5|
|2  |yellow|     |
|3  |white |data7|
|3  |      |data8|
+----------------+

I need to modify this dataframe to look like this :

+---+--------------------+---------------------+
|id |color               |datas                |
+----------------------------------------------+
|1  |[blue, red, orange] |[data1, data2, data3]|
|2  |[black, yellow]     |[data4, data5]       |
|3  |[white]             |[data7, data8]       |
+----------------------------------------------+

I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.

I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.

Thank you

Edit : I'm using Spark 1.6

2 Answers 2

2

you can group by "id" and then use collect_list function to get the aggregated values.

dataframe.groupBy("id").agg(collect_list(struct("color")).as("color"), collect_list(struct("dates")).as("dates") )

Hope this helps

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the fast answer but unfortunately, I got a "org.apache.spark.sql.AnalysisException: undefined function collect_list" when I try with collect_list I'm using Spark 1.6
did you import the function?? import static org.apache.spark.sql.functions.*;
yes I did, I found this answer for the undefined function stackoverflow.com/questions/35324049/… I'm trying to solve it
if you import function from spark sql it should work.
0

The actual function that works for me is:

dataframe.groupBy("id").agg(collect_list("color").as("color"), collect_list("date").as("date") ) dataframe.createOrReplaceTempView("dataframe")

Then create a new query where you can use the struct()

dffinal = spark.sql(s"""SELECT struct(a.color) AS colors, struct(a.date) AS dates FROM dataframe a """)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.