Unexpected tuple with StructType - Error in pyspark when using schema to create a data frame

Question

I am trying to do the following:

subschema = T.ArrayType(T.StructType([
    T.StructField("AA", T.LongType(), True),
    T.StructField("BB", T.StringType(), True),
]), True)
s = T.StructType([
    T.StructField("B", subschema, True),
    T.StructField("A", T.StringType(), True),
])
d = [Row(
    B=None,
    A="AAA",
)]
df = spark.createDataFrame(d, schema=s)

But I am getting an error that does not make sense to me: ValueError: Unexpected tuple 'A' with StructType

If I comment either row A or row B, the error disappears, but I don't understand why this is happening. What is the problem? Is this a bug, or is there something wrong in my code?

blackbishop · Accepted Answer · 2021-02-03 22:57:38Z

2

This is due to the alphabetical ordering of the fields when you create a Row using keyword arguments. Here it tries to apply the type of B to the field A.

In Spark 3, this was removed, I was able to run your code without any error.

For Spark < 3, you need to sort the fields in your schema too, A before B :

s = T.StructType([
    T.StructField("A", T.StringType(), True),
    T.StructField("B", subschema, True)
])

Or simply create RDD from tuple:

rdd = sc.parallelize([(None, "AAA")])
df = spark.createDataFrame(rdd, schema=s)

edited Feb 3, 2021 at 22:57

answered Feb 3, 2021 at 22:43

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

someguy Over a year ago

Thanks! I have see that using a dictionary instead of a Row also works, so I don't need to change the order in my schema: d = [{"B": None, "A": "AAA"}]

Collectives™ on Stack Overflow

Unexpected tuple with StructType - Error in pyspark when using schema to create a data frame

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related