0

I am trying to split an Dataframe into multiple arrays according to their id.

So I have a table

id|name
12|a
12|b
12|c
13|z
13|y
13|z

and I want to get multiple vectors that look like:

<a,b,c> <x,y,z> 

So I have managed to get all the different IDs using:

val ids=dataframe.select("id").distinct.collect.flatMap(_.toSeq)

and that would return 12 and 13. And I have tried to get for each one of them the names:

val namesArray=ids.map(id=>dataframe.where($"id"===id))

but that doesnt seem to be the way since it is returning the column names and I should find a way to get only the name out of it.

1 Answer 1

1

If you already have a DataSet with your data then you can do the following,

val yourDataSet = sc.parallelize(List((12, "a"), (12, "b"), (13, "y"), (13, "z"))).toDF("id", "val")

val requiredDataSet = yourDataSet
  .groupBy("id")
  .agg(collect_list("val"))
  .select("collect_list(val)")

Or you can achieve this by getting the underlying Rdd and then transforming it.

val vaueVectorRdd = dataframe.rdd
  .map(row.getInt(0), row.getString(1))
  .groupByKey({ case (k, v) => k })
  .map({ case (k, iter) => iter.map(_._2).toVector })
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @Saravesh Kumar Singh for your reply. collect_list is not being recognized by the compiler. What did you mean by that?
org.apache.spark.sql.functions.collect_list

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.