How to transform multiple dataframe columns into one numpy array column

Question

I have a dataframe like below

from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np

config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")

Now I need to keep the first column and the rest columns as a numpy array (two columns : "doc" and a numpy array column)

I know that

sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf

Translate all the columns into a numpy array but how do I append the numpy array with the first column ? Any help is appreciated.

Georgina Skibinski · Accepted Answer · 2019-10-02 12:39:35Z

1

Unfortunately you cannot make numpy.array column in pyspark dataframe, but you can use regular python list instead, and convert it while reading:

>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
|  doc|    words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+

>>> df
DataFrame[doc: string, words: array<bigint>]

And to get this from 4 columns you had, you can:

>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
|  doc|    words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+

edited Oct 2, 2019 at 12:39

answered Oct 2, 2019 at 12:10

Georgina Skibinski

13.5k2 gold badges16 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to transform multiple dataframe columns into one numpy array column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related