0

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.

For example, the PySpark dataframe is like this

| Id     | Name   |
| ------ | ------ |
| 1      | Bob    |
| 2      | Alice  |
| 3      | Mike   |

And the numpy matrix is like this

[[2, 3, 5]
 [5, 2, 6]
 [1, 4, 7]]

The resulting expected dataframe should be like this

| Id     | Name   | customized_list
| ------ | ------ | ---------------
| 1      | Bob    |   [2, 3, 5]
| 2      | Alice  |   [5, 2, 6]
| 3      | Mike   |   [1, 4, 7]

Id column correspond to the order of the entries in the numpy matrix.

I wonder is there any efficient way to implement this?

2
  • does the Id column correspond to the order of the entries in the numpy matrix? Commented Oct 4, 2019 at 18:44
  • Yes, will add it in description Commented Oct 4, 2019 at 19:56

1 Answer 1

2

Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.

import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#|  1|      [2, 3, 5]|
#|  2|      [5, 2, 6]|
#|  3|      [1, 4, 7]|
#+---+---------------+

Here I used enumerate(..., start=1) to add the row number.

Now just do an inner join:

df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#|  1|  Bob|      [2, 3, 5]|
#|  3| Mike|      [1, 4, 7]|
#|  2|Alice|      [5, 2, 6]|
#+---+-----+---------------+
Sign up to request clarification or add additional context in comments.

2 Comments

What's solution if I do not have the identifier like "Id", which is a list of increasing numbers starting from 1.
@XINLIU then you will have to add an Id column: Pyspark add sequential and deterministic index to dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.