Collect Spark dataframe into Numpy matrix

Question

I've used spark to compute the PCA on a large dataset, now I have a spark dataframe with the following structure:

Row('pcaFeatures'=DenseVector(elem1,emlem2..))

where elem1,..., elemN are double numbers. I would like to transform it in a numpy matrix. Right now I'm using the following code:

numpymatrix = datapca.toPandas().as_Matrix()

but I get a numpy Series with elements of type Object instead of a numeric matrix. Is there a way to get the matrix I need?

Alper t. Turker · Accepted Answer · 2018-01-29 20:37:48Z

Your request makes sense only if the resulting data can fit into your main memory (i.e. you can safely use collect()); on the other hand, if this is the case, admittedly you have absolutely no reason to use Spark at all.

Anyway, given this assumption, here is a general way to convert a single-column features Spark dataframe (Rows of DenseVector) to a NumPy array using toy data:

spark.version
# u'2.2.0' 

from pyspark.ml.linalg import Vectors
import numpy as np

# toy data:
df = spark.createDataFrame([(Vectors.dense([0,45,63,0,0,0,0]),),
                            (Vectors.dense([0,0,0,85,0,69,0]),),
                            (Vectors.dense([0,89,56,0,0,0,0]) ,),
                           ], ['features'])

dd = df.collect()
dd
# [Row(features=DenseVector([0.0, 45.0, 63.0, 0.0, 0.0, 0.0, 0.0])), 
#  Row(features=DenseVector([0.0, 0.0, 0.0, 85.0, 0.0, 69.0, 0.0])), 
#  Row(features=DenseVector([0.0, 89.0, 56.0, 0.0, 0.0, 0.0, 0.0]))] 

np.asarray([x[0] for x in dd])
# array([[ 0., 45., 63., 0., 0., 0., 0.],
#        [ 0., 0., 0., 85., 0., 69., 0.],
#        [ 0., 89., 56., 0., 0., 0., 0.]])

Collectives™ on Stack Overflow

Collect Spark dataframe into Numpy matrix

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related