Add different arrays from numpy to each row of dataframe

Question

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.

For example, the PySpark dataframe is like this

| Id     | Name   |
| ------ | ------ |
| 1      | Bob    |
| 2      | Alice  |
| 3      | Mike   |

And the numpy matrix is like this

[[2, 3, 5]
 [5, 2, 6]
 [1, 4, 7]]

The resulting expected dataframe should be like this

| Id     | Name   | customized_list
| ------ | ------ | ---------------
| 1      | Bob    |   [2, 3, 5]
| 2      | Alice  |   [5, 2, 6]
| 3      | Mike   |   [1, 4, 7]

Id column correspond to the order of the entries in the numpy matrix.

I wonder is there any efficient way to implement this?

does the Id column correspond to the order of the entries in the numpy matrix? — pault
– pault, Commented Oct 4, 2019 at 18:44

pault · Accepted Answer · 2019-10-04 20:15:31Z

2

Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.

import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#|  1|      [2, 3, 5]|
#|  2|      [5, 2, 6]|
#|  3|      [1, 4, 7]|
#+---+---------------+

Here I used enumerate(..., start=1) to add the row number.

Now just do an inner join:

df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#|  1|  Bob|      [2, 3, 5]|
#|  3| Mike|      [1, 4, 7]|
#|  2|Alice|      [5, 2, 6]|
#+---+-----+---------------+

answered Oct 4, 2019 at 20:15

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

XIN LIU Over a year ago

What's solution if I do not have the identifier like "Id", which is a list of increasing numbers starting from 1.

pault Over a year ago

@XINLIU then you will have to add an Id column: Pyspark add sequential and deterministic index to dataframe.

Collectives™ on Stack Overflow

Add different arrays from numpy to each row of dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related