I am trying to add an Array of values as a new column to the DataFrame.
Ex: Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe