Perform Arithmetic Operations on multiple columns in Spark dataframe

Question

I have an input spark-dataframe named df as

+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
|         725153|  1|  0|  2|          3|
|         873008|  0|  0|  3|          3|
|         625109|  1|  1|  0|          2|
+---------------+---+---+---+-----------+

Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,

+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|                P1| P2|                P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
|         725153|0.3333333333333333|0.0|0.6666666666666666|          3|
|         873008|               0.0|0.0|               1.0|          3|
|         625109|               0.5|0.5|               0.0|          2|
+---------------+------------------+---+------------------+-----------+

I have done this using Scala as,

val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
  if (index != "Main_CustomerID" && index != "Total_Count") {
  frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}

But I don't prefer this for loop because my df is of larger size. What is the optimized solution?

Anahcolus · Accepted Answer · 2018-06-29 14:22:55Z

2

If you have dataframe as

val df = Seq(
  ("725153", 1, 0, 2, 3),
  ("873008", 0, 0, 3, 3),
  ("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")

+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153         |1  |0  |2  |3          |
|873008         |0  |0  |3  |3          |
|625109         |1  |1  |0  |2          |
+---------------+---+---+---+-----------+

You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3

val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList

df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)

which should give you

+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1                |P2 |P3                |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153         |0.3333333333333333|0.0|0.6666666666666666|3          |
|873008         |0.0               |0.0|1.0               |3          |
|625109         |0.5               |0.5|0.0               |2          |
+---------------+------------------+---+------------------+-----------+

I hope the answer is helpful

answered Jun 29, 2018 at 14:22

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PRIYA M Over a year ago

Yes.. This works perfect. But if I used foldLeft, whether it ll occupy more heap space?

Anahcolus Over a year ago

I don't think so :)

Collectives™ on Stack Overflow

Perform Arithmetic Operations on multiple columns in Spark dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related