1

In my dataframe, I need to convert an array data type column to struct. I can manually do that with a sample of data (by modifying in editor) and it is the data that I need. I need to do it in PySpark.

Input dataframe schema:

root
 |-- id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- documents: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- doc_name: string (nullable = true)
 |    |    |-- obligations: struct (containsNull = true)
 |-- contacts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- contact_first_name: string (nullable = true)
 |    |    |-- contact_last_name: string (nullable = true)

Data:

{
   "id":"123",
   "description": "agreement",
   "documents":[
     {
       "id":"doc_id_1",
       "doc_name":"doc_name_1",
       "obligations":{}
     }
   ],
   "contacts":[
    {
      "id":"contact_id_1",
      "contact_first_name":"John",
      "contact_last_name":"Doe"
    }
  ]
}

Schema that I need:

root
 |-- id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- documents: struct (containsNull = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- doc_name: string (nullable = true)
 |    |    |-- obligations: struct (containsNull = true)
 |-- contacts: struct (containsNull = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- contact_first_name: string (nullable = true)
 |    |    |-- contact_last_name: string (nullable = true)

Data that I need:

{
   "id":"123",
   "description": "agreement",
   "documents":{
     {
       "id":"doc_id_1",
       "doc_name":"doc_name_1",
       "obligations":{}
     }
   },
   "contacts":{
    {
      "id":"contact_id_1",
      "contact_first_name":"John",
      "contact_last_name":"Doe"
    }
  }
}
0

1 Answer 1

1

Arrays differ from structs in a way that arrays can hold many items. In your current setup, you have an array of structs - that array may potentially hold many structs. Only if you are sure that your array holds just one struct, you can safely just extract the first element in the array and put it one level higher (removing the array) like this:

df = df.withColumn('contacts', F.col('contacts')[0])

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("123", "agreement", [("doc_id_1", "doc_name_1",())], [("contact_id_1", "John", "Doe")],)],
    "id string, description string, documents array<struct<id:string,doc_name:string,obligations:struct<>>>, contacts array<struct<id:string,contact_first_name:string,contact_last_name:string>>")
df.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- description: string (nullable = true)
#  |-- documents: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- id: string (nullable = true)
#  |    |    |-- doc_name: string (nullable = true)
#  |    |    |-- obligations: struct (nullable = true)
#  |-- contacts: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- id: string (nullable = true)
#  |    |    |-- contact_first_name: string (nullable = true)
#  |    |    |-- contact_last_name: string (nullable = true)

df = df.withColumn('documents', F.col('documents')[0])
df = df.withColumn('contacts', F.col('contacts')[0])

df.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- description: string (nullable = true)
#  |-- documents: struct (nullable = true)
#  |    |-- id: string (nullable = true)
#  |    |-- doc_name: string (nullable = true)
#  |    |-- obligations: struct (nullable = true)
#  |-- contacts: struct (nullable = true)
#  |    |-- id: string (nullable = true)
#  |    |-- contact_first_name: string (nullable = true)
#  |    |-- contact_last_name: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.