Problems while transforming pandas dataframe to PySpark RDD?

Question

With pandas read_csv() function I read an iso-8859-1 file as follows:

df = pd.read_csv('path/file', \
                   sep = '|',names =['A','B'], encoding='iso-8859-1')

Then, I would like to use MLLib's word2vect. However, it only accepts as a parameter RDDs. So I tried to transform the pandas dataframe to an RDD as follows:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()

Anyhow, I got the following exception:

TypeError: Can not infer schema for type: <type 'unicode'>

I went to Pyspark's documentation in order to see if there is something like an encoding parameter, but I did not found anything. Any idea of how to transform an specific pandas dataframe column to a Pyspark RDD?.

update:

From @zeros answer this is what I tried save the columnn as a dataframe, like this:

new_dataframe = df_3.loc[:,'A']
new_dataframe.head()

Then:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

And I got the same exception:

TypeError: Can not infer schema for type: <type 'unicode'>

zero323 · Accepted Answer · 2016-03-18 00:19:26Z

2

When you use df['A'] is not a pandas.DataFrame but pandas.Series hence when you pass it to SqlContext.createDataFrame it is treated as any other Iterable and PySpark doesn't support conversion of simple types to DataFrame.

If you want to keep data as Pandas DataFrame use loc method:

df.loc[:,'A']

answered Mar 18, 2016 at 0:19

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tumbleweed · Accepted Answer · 2016-03-18 01:12:00Z

0

From @zeros323 answer I noted that it actually was not a pandas dataframe. I consulted pandas documentation and found that to_frame() can convert that specific column in a pandas dataframe. So I did the following:

new_dataframe = df['A'].to_frame()
new_dataframe.head()
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

answered Mar 18, 2016 at 1:12

tumbleweed

4,67012 gold badges57 silver badges84 bronze badges

Collectives™ on Stack Overflow

Problems while transforming pandas dataframe to PySpark RDD?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related