I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?