PySpark: TypeError: StructType can not accept object in type <type 'unicode'> or <type 'str'>

Question

I am reading data from a CSV file and then creating a DataFrame. But when I try to access the data in the DataFrame I get TypeError.

fields = [StructField(field_name, StringType(), True) for field_name in schema.split(',')]
schema = StructType(fields)

input_dataframe = sql_context.createDataFrame(input_data_1, schema)

print input_dataframe.filter(input_dataframe.diagnosis_code == '11').count()

Both 'unicode' and 'str' are not working with Spark DataFrame. I get the below TypeError:

TypeError: StructType can not accept object in type TypeError: StructType can not accept object in type

I tried encoding in 'utf-8' as below but still get the error but now complaining about TypeError with 'str':

input_data_2 = input_data_1.map(lambda x: x.encode("utf-8"))
input_dataframe = sql_context.createDataFrame(input_data_2, schema)

print input_dataframe.filter(input_dataframe.diagnosis_code == '410.11').count()

I also tried parsing the CSV directly as utf-8 or unicode using the param use_unicode=True/False

Alper t. Turker · Accepted Answer · 2017-12-07 20:27:59Z

3

Reading between the lines. You are

reading data from a CSV file

and get

TypeError: StructType can not accept object in type <type 'unicode'>

This happens because you pass a string not an object compatible with struct. Probably you pass data like:

input_data_1 = sc.parallelize(["1,foo,2", "2,bar,3"])

and schema

schema = "x,y,z"

fields = [StructField(field_name, StringType(), True) for field_name in schema.split(',')]
schema = StructType(fields)

and you expect Spark to figure things out. But it doesn't work that way. You could

input_dataframe = sqlContext.createDataFrame(input_data_1.map(lambda s: s.split(",")), schema)

but honestly just use Spark csv reader:

spark.read.schema(schema).csv("/path/to/file")

answered Dec 7, 2017 at 20:27

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kev Over a year ago

I get pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 55' when I try spark.read.schema. I am reading from a directory inside which I have partitioned data with multiple gzipped .csv files

Collectives™ on Stack Overflow

PySpark: TypeError: StructType can not accept object in type <type 'unicode'> or <type 'str'>

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related