1

Assuming I have the following table:

body
{"Day":1,"vals":[{"id":"1", "val":"3"}, {"id":"2", "val":"4"}]}

My goal is to write down the schema in Pyspark for this nested json column. I've tried the following two things:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  StructType([
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
  ])
  )
])

Here I get the error that of

'StructType' object has no attribute 'name'

Another approach was to declare the nested Arrays as ArrayType:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  ArrayType(
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    , True)
  )
])

Here I get the following error:

takes from 2 to 3 positional arguments but 5 were given

Which propably comes from the array only taking the Sql type as an argument.

Can anybody tell me what their approach would be to create the schema, since I'm a super newbie to the whole topic..

1 Answer 1

1

This is the structure you are looking for:

Data = [
    (1, [("1","3"), ("2","4")])
  ]

schema = StructType([
        StructField('Day', IntegerType(), True),
        StructField('vals', ArrayType(StructType([
            StructField('id', StringType(), True),
            StructField('val', StringType(), True)
            ]),True))
         ])
df = spark.createDataFrame(data=Data,schema=schema)
df.printSchema()
df.show(truncate=False)

This will get you the next output:

root
 |-- Day: integer (nullable = true)
 |-- vals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- val: string (nullable = true)

+---+----------------+
|Day|vals            |
+---+----------------+
|1  |[{1, 3}, {1, 3}]|
+---+----------------+
Sign up to request clarification or add additional context in comments.

2 Comments

thank you for your answer! I saw that I made a mistake when creating the example table. Both Structypes belong to the same Structfield "vals". I still didn't really find a solution for this..
My bad, I thought you were missing just the name for the second nested StructType. Check my new answer! Hope it helps

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.