3

I am trying to do the following:

subschema = T.ArrayType(T.StructType([
    T.StructField("AA", T.LongType(), True),
    T.StructField("BB", T.StringType(), True),
]), True)
s = T.StructType([
    T.StructField("B", subschema, True),
    T.StructField("A", T.StringType(), True),
])
d = [Row(
    B=None,
    A="AAA",
)]
df = spark.createDataFrame(d, schema=s)

But I am getting an error that does not make sense to me: ValueError: Unexpected tuple 'A' with StructType

If I comment either row A or row B, the error disappears, but I don't understand why this is happening. What is the problem? Is this a bug, or is there something wrong in my code?

1 Answer 1

2

This is due to the alphabetical ordering of the fields when you create a Row using keyword arguments. Here it tries to apply the type of B to the field A.

In Spark 3, this was removed, I was able to run your code without any error.

For Spark < 3, you need to sort the fields in your schema too, A before B :

s = T.StructType([
    T.StructField("A", T.StringType(), True),
    T.StructField("B", subschema, True)
])

Or simply create RDD from tuple:

rdd = sc.parallelize([(None, "AAA")])
df = spark.createDataFrame(rdd, schema=s)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I have see that using a dictionary instead of a Row also works, so I don't need to change the order in my schema: d = [{"B": None, "A": "AAA"}]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.