0

I have an input spark-dataframe named df as

+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
|         725153|  1|  0|  2|          3|
|         873008|  0|  0|  3|          3|
|         625109|  1|  1|  0|          2|
+---------------+---+---+---+-----------+

Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,

+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|                P1| P2|                P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
|         725153|0.3333333333333333|0.0|0.6666666666666666|          3|
|         873008|               0.0|0.0|               1.0|          3|
|         625109|               0.5|0.5|               0.0|          2|
+---------------+------------------+---+------------------+-----------+

I have done this using Scala as,

val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
  if (index != "Main_CustomerID" && index != "Total_Count") {
  frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}

But I don't prefer this for loop because my df is of larger size. What is the optimized solution?

1 Answer 1

2

If you have dataframe as

val df = Seq(
  ("725153", 1, 0, 2, 3),
  ("873008", 0, 0, 3, 3),
  ("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")

+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153         |1  |0  |2  |3          |
|873008         |0  |0  |3  |3          |
|625109         |1  |1  |0  |2          |
+---------------+---+---+---+-----------+

You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3

val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList

df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)

which should give you

+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1                |P2 |P3                |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153         |0.3333333333333333|0.0|0.6666666666666666|3          |
|873008         |0.0               |0.0|1.0               |3          |
|625109         |0.5               |0.5|0.0               |2          |
+---------------+------------------+---+------------------+-----------+

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

2 Comments

Yes.. This works perfect. But if I used foldLeft, whether it ll occupy more heap space?
I don't think so :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.