With pandas read_csv() function I read an iso-8859-1 file as follows:
df = pd.read_csv('path/file', \
sep = '|',names =['A','B'], encoding='iso-8859-1')
Then, I would like to use MLLib's word2vect. However, it only accepts as a parameter RDDs. So I tried to transform the pandas dataframe to an RDD as follows:
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()
Anyhow, I got the following exception:
TypeError: Can not infer schema for type: <type 'unicode'>
I went to Pyspark's documentation in order to see if there is something like an encoding parameter, but I did not found anything. Any idea of how to transform an specific pandas dataframe column to a Pyspark RDD?.
update:
From @zeros answer this is what I tried save the columnn as a dataframe, like this:
new_dataframe = df_3.loc[:,'A']
new_dataframe.head()
Then:
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()
And I got the same exception:
TypeError: Can not infer schema for type: <type 'unicode'>