Type Casting Large number of Struct Fields to String using Pyspark

Question

I have a pyspark df who's schema looks like this

 root
 |-- company: struct (nullable = true)
 |    |-- 0: long(nullable = true)
 |    |-- 1: long(nullable = true)
 |    |-- 10: long(nullable = true)
 |    |-- 100: long(nullable = true)
 |    |-- 101: long(nullable = true)
 |    |-- 102: long(nullable = true)
 |    |-- 103: long(nullable = true)
 |    |-- 104: long(nullable = true)
 |    |-- 105: long(nullable = true)
 |    |-- 106: long(nullable = true)
 |    |-- 107: long(nullable = true)
 |    |-- 108: long(nullable = true)
 |    |-- 109: long(nullable = true)

How do I convert all these fields to String in PySpark.

df = self.spark.createDataFrame( data=[ ('0',1001), ('1', 1002), ('2', 1003), ('3', 1005), ], schema=T.StructType([ T.StructField('id', T.StringType(), nullable=True), T.StructField('value1', T.LongType(), nullable=True), ]), ) — Naresh Tambekar
– Naresh Tambekar, Commented Jun 25, 2020 at 14:03

Raghu · Accepted Answer · 2020-06-27 01:08:57Z

I have tried with my own test dataset, check if it works for you. The answer is inspired from here : Pyspark - Looping through structType and ArrayType to do typecasting in the structfield Refer for more details

#Create test data frame
tst= sqlContext.createDataFrame([(1,1,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','x','y'])
tst_struct = tst.withColumn("str_col",F.struct('x','y'))
old_schema = tst_struct.schema
res=[]
# Function to transform the schema to string
def transform(schema):
    res=[]
    for f in schema.fields:
        res.append(StructField(f.name, StringType(), f.nullable))
    return(StructType(res))
# Traverse through existing schema and change it when struct type is encountered
new_schema=[]
for f in old_schema.fields:
    if(isinstance(f.dataType,StructType)):
        new_schema.append(StructField(f.name,transform(f.dataType),f.nullable))
    else:
        new_schema.append(StructField(f.name,f.dataType,f.nullable))
# Transform the dataframe with new schema
tst_trans=tst_struct.select([F.col(f.name).cast(f.dataType) for f in new_schema])

This is the scheme of test dataset:

tst_struct.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: long (nullable = true)
 |    |-- y: long (nullable = true)

This is the transformed schema

tst_trans.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)

If you need to explode the struct columns into seperate columns , you can do the below:(Refer: How to unwrap nested Struct column into multiple columns?).

So, finally

tst_exp.show()
+----+----+---+---+--------+---+---+
|col1|col2|  x|  y| str_col|  x|  y|
+----+----+---+---+--------+---+---+
|   1|   1|  2| 11| [2, 11]|  2| 11|
|   1|   3|  4| 12| [4, 12]|  4| 12|
|   1|   5|  6| 13| [6, 13]|  6| 13|
|   1|   7|  8| 14| [8, 14]|  8| 14|
|   2|   9| 10| 15|[10, 15]| 10| 15|
|   2|  11| 12| 16|[12, 16]| 12| 16|
|   2|  13| 14| 17|[14, 17]| 14| 17|
+----+----+---+---+--------+---+---+
tst_exp = tst_trans.select(tst_trans.columns+[F.col('str_col.*')])

Happy to hear. Can you also upvote the answer, which is the common community practice

Collectives™ on Stack Overflow

Type Casting Large number of Struct Fields to String using Pyspark

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related