0

I have data of following format:

+-----+---------------+
| name|           Data|
+-----+---------------+
|Alpha|      [A, B, C]|
| Beta|   [A, B, C, D]|
|Gamma|[A, B, C, D, E]|
+-----+---------------+

How to transform it into ?

+-----+----+-----+-----+-----+-----+
| name|   A|    B|    C|    D|    E|
+-----+----+-----+-----+-----+-----+
|Alpha|   1|    1|    1|    0|    0|
| Beta|   1|    1|    1|    1|    0|
|Gamma|   1|    1|    1|    1|    1|
+-----+----+-----+-----+-----+-----+

Thanks to @Jarrod Baker for help in similar transformation earlier

Here is the code that i have:

val df = Seq(
      ("Alpha", Array("A", "B", "C")),
      ("Beta", Array("A", "B", "C", "D")),
      ("Gamma", Array("A", "B", "C", "D", "E")),
).toDF("name", "Data")
df.show()

val arrayDataSize = df.withColumn("arr_size", size(col("Data"))).agg(max("arr_size") as "maxSize") 

val newDF = df.select(($"name") +: (0 until arrayDataSize.first.getInt(0)).map(i => {($"Data") (i).contains("A").alias("A") }): _*)

newDF.show()
+-----+----+-----+-----+-----+-----+
| name|   A|    A|    A|    A|    A|
+-----+----+-----+-----+-----+-----+
|Alpha|true|false|false| null| null|
| Beta|true|false|false|false| null|
|Gamma|true|false|false|false|false|
+-----+----+-----+-----+-----+-----+

Thanks in advance for your help.

1

1 Answer 1

0

You can use the RelationalGroupedDataset's pivot method to achieve what you want. To create such a Dataset, you need to use groupBy on a Dataset.

It would look something like this:

import spark.implicits._

val df = Seq(
  ("Alpha", Seq("A", "B", "C")),
  ("Beta", Seq("A", "B", "C", "D")),
  ("Gamma", Seq("A", "B", "C", "D", "E"))
).toDF("name", "Data")

val output = df
  .select(df("name"), explode(col("Data")).alias("Data"))
  .groupBy("name")
  .pivot("Data")
  .count()

output.show()
+-----+---+---+---+----+----+
| name|  A|  B|  C|   D|   E|
+-----+---+---+---+----+----+
| Beta|  1|  1|  1|   1|null|
|Gamma|  1|  1|  1|   1|   1|
|Alpha|  1|  1|  1|null|null|
+-----+---+---+---+----+----+

As you can see, we're first explode-ing our Sequences into separate rows. This allows us to treat each element in each sequence as a separate "entity".

Then, we're using groupBy to get our RelationalGroupedDataset, after which we pivot and count the occurences.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.