0

May I ask what is the best way to transpose the rows into columns with multiple fields?

I have a dataframe as below.

val inputDF = Seq(
                  ("100","A", 10, 200),
                  ("100","B", 20, 300),
                  ("101","A", 30, 100)
              ).toDF("ID", "Type", "Value", "Allocation")

I want to generate a dataframe as below.

val outputDF = Seq(
                  ("100", 10, 200, 20, 300),
                  ("101", 30, 100, NULL, NULL)
              ).toDF("ID", "Type_A", "Value_A", "Allocation_A", "Type_B", "Value_B", "Allocation_B")

I tried to use pivot as below.

val outputDF = inputDF.groupBy("ID", "Type").pivot("Type).agg(first("Value"), first("Allocation"))

It generated something as below, which is not what I wanted.

+---+----+--------------+-------------------+--------------+-------------------+
| ID|Type|A_first(Value)|A_first(Allocation)|B_first(Value)|B_first(Allocation)|
+---+----+--------------+-------------------+--------------+-------------------+
|100|   B|          null|               null|            20|                300|
|100|   A|            10|                200|          null|               null|
|101|   A|            30|                100|          null|               null|
+---+----+--------------+-------------------+--------------+-------------------+

Thank you very much!

3
  • I identified the issue. It should be val outputDF = inputDF.groupBy("ID").pivot("Type").agg(first("Value"), first("Allocation")). Then, it generated the dataframe as my expectation. The columns are named as A_first(Value), A_first(Allocation), B_first(value), B_first(Allocation). Can I rename them somehow in the same statement with pivot functions? Commented Feb 8, 2021 at 2:14
  • you could just provide aliases to column and it should be good Commented Feb 8, 2021 at 3:56
  • Thanks, but how should I give the alias for "Value_A" and "Value_B" separately? For example, I may give first("Value").alias("Value"), but I can not make it to have name "Value_A" and "Value_B" automatically according to the pivot results. Commented Feb 8, 2021 at 6:56

1 Answer 1

1

This might not be the most efficient/clean approach but it works when you provide aliases.

//Source data
val inputDF = Seq(
                  ("100","A", 10, 200),
                  ("100","B", 20, 300),
                  ("101","A", 30, 100)
              ).toDF("ID", "Type", "Value", "Allocation")

val valueColumn = inputDF.columns.tail(1)
val allocationColumn = inputDF.columns.tail(2)

import org.apache.spark.sql.functions._
val outputDF = inputDF.groupBy("ID").pivot("Type").agg(first(s"$valueColumn").as(s"$valueColumn"), first(s"$allocationColumn").as(s"$allocationColumn"))
display(outputDF)

You can see the output as below :

enter image description here

I could not identify a way to make it more generic. Also it is prefixing the column name with the type value but that would kind of help and should work if you just want to distinguish the columns based on their value.

See if it helps or someone could come up with a more generic/dynamic approach based on this answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.