Spark: Transpose Rows to Columns with Multiple Fields

Question

May I ask what is the best way to transpose the rows into columns with multiple fields?

I have a dataframe as below.

val inputDF = Seq(
                  ("100","A", 10, 200),
                  ("100","B", 20, 300),
                  ("101","A", 30, 100)
              ).toDF("ID", "Type", "Value", "Allocation")

I want to generate a dataframe as below.

val outputDF = Seq(
                  ("100", 10, 200, 20, 300),
                  ("101", 30, 100, NULL, NULL)
              ).toDF("ID", "Type_A", "Value_A", "Allocation_A", "Type_B", "Value_B", "Allocation_B")

I tried to use pivot as below.

val outputDF = inputDF.groupBy("ID", "Type").pivot("Type).agg(first("Value"), first("Allocation"))

It generated something as below, which is not what I wanted.

+---+----+--------------+-------------------+--------------+-------------------+
| ID|Type|A_first(Value)|A_first(Allocation)|B_first(Value)|B_first(Allocation)|
+---+----+--------------+-------------------+--------------+-------------------+
|100|   B|          null|               null|            20|                300|
|100|   A|            10|                200|          null|               null|
|101|   A|            30|                100|          null|               null|
+---+----+--------------+-------------------+--------------+-------------------+

Thank you very much!

I identified the issue. It should be val outputDF = inputDF.groupBy("ID").pivot("Type").agg(first("Value"), first("Allocation")). Then, it generated the dataframe as my expectation. The columns are named as A_first(Value), A_first(Allocation), B_first(value), B_first(Allocation). Can I rename them somehow in the same statement with pivot functions? — yyuankm
– yyuankm, Commented Feb 8, 2021 at 2:14
you could just provide aliases to column and it should be good — Nikunj Kakadiya
– Nikunj Kakadiya, Commented Feb 8, 2021 at 3:56
Thanks, but how should I give the alias for "Value_A" and "Value_B" separately? For example, I may give first("Value").alias("Value"), but I can not make it to have name "Value_A" and "Value_B" automatically according to the pivot results. — yyuankm
– yyuankm, Commented Feb 8, 2021 at 6:56

Nikunj Kakadiya · Accepted Answer · 2021-02-08 07:37:43Z

This might not be the most efficient/clean approach but it works when you provide aliases.

//Source data
val inputDF = Seq(
                  ("100","A", 10, 200),
                  ("100","B", 20, 300),
                  ("101","A", 30, 100)
              ).toDF("ID", "Type", "Value", "Allocation")

val valueColumn = inputDF.columns.tail(1)
val allocationColumn = inputDF.columns.tail(2)

import org.apache.spark.sql.functions._
val outputDF = inputDF.groupBy("ID").pivot("Type").agg(first(s"$valueColumn").as(s"$valueColumn"), first(s"$allocationColumn").as(s"$allocationColumn"))
display(outputDF)

You can see the output as below :

I could not identify a way to make it more generic. Also it is prefixing the column name with the type value but that would kind of help and should work if you just want to distinguish the columns based on their value.

See if it helps or someone could come up with a more generic/dynamic approach based on this answer.

Collectives™ on Stack Overflow

Spark: Transpose Rows to Columns with Multiple Fields

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related