0

I have a dataframe like below

from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np

config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")

Now I need to keep the first column and the rest columns as a numpy array (two columns : "doc" and a numpy array column)

I know that

sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf

Translate all the columns into a numpy array but how do I append the numpy array with the first column ? Any help is appreciated.

1 Answer 1

1

Unfortunately you cannot make numpy.array column in pyspark dataframe, but you can use regular python list instead, and convert it while reading:

>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
|  doc|    words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+

>>> df
DataFrame[doc: string, words: array<bigint>]

And to get this from 4 columns you had, you can:

>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
|  doc|    words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.