1

I have a pyspark df who's schema looks like this

 root
 |-- company: struct (nullable = true)
 |    |-- 0: long(nullable = true)
 |    |-- 1: long(nullable = true)
 |    |-- 10: long(nullable = true)
 |    |-- 100: long(nullable = true)
 |    |-- 101: long(nullable = true)
 |    |-- 102: long(nullable = true)
 |    |-- 103: long(nullable = true)
 |    |-- 104: long(nullable = true)
 |    |-- 105: long(nullable = true)
 |    |-- 106: long(nullable = true)
 |    |-- 107: long(nullable = true)
 |    |-- 108: long(nullable = true)
 |    |-- 109: long(nullable = true)

How do I convert all these fields to String in PySpark.

4
  • can you share the code to reproduce a test data? Commented Jun 25, 2020 at 13:41
  • df = self.spark.createDataFrame( data=[ ('0',1001), ('1', 1002), ('2', 1003), ('3', 1005), ], schema=T.StructType([ T.StructField('id', T.StringType(), nullable=True), T.StructField('value1', T.LongType(), nullable=True), ]), ) Commented Jun 25, 2020 at 14:03
  • did you try the answer? did it work? Commented Jun 28, 2020 at 4:59
  • Yes It worked Thanks for the Solution Commented Jun 29, 2020 at 4:53

1 Answer 1

1

I have tried with my own test dataset, check if it works for you. The answer is inspired from here : Pyspark - Looping through structType and ArrayType to do typecasting in the structfield Refer for more details

#Create test data frame
tst= sqlContext.createDataFrame([(1,1,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','x','y'])
tst_struct = tst.withColumn("str_col",F.struct('x','y'))
old_schema = tst_struct.schema
res=[]
# Function to transform the schema to string
def transform(schema):
    res=[]
    for f in schema.fields:
        res.append(StructField(f.name, StringType(), f.nullable))
    return(StructType(res))
# Traverse through existing schema and change it when struct type is encountered
new_schema=[]
for f in old_schema.fields:
    if(isinstance(f.dataType,StructType)):
        new_schema.append(StructField(f.name,transform(f.dataType),f.nullable))
    else:
        new_schema.append(StructField(f.name,f.dataType,f.nullable))
# Transform the dataframe with new schema
tst_trans=tst_struct.select([F.col(f.name).cast(f.dataType) for f in new_schema])

This is the scheme of test dataset:

tst_struct.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: long (nullable = true)
 |    |-- y: long (nullable = true)

This is the transformed schema

tst_trans.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)

If you need to explode the struct columns into seperate columns , you can do the below:(Refer: How to unwrap nested Struct column into multiple columns?).

So, finally

tst_exp.show()
+----+----+---+---+--------+---+---+
|col1|col2|  x|  y| str_col|  x|  y|
+----+----+---+---+--------+---+---+
|   1|   1|  2| 11| [2, 11]|  2| 11|
|   1|   3|  4| 12| [4, 12]|  4| 12|
|   1|   5|  6| 13| [6, 13]|  6| 13|
|   1|   7|  8| 14| [8, 14]|  8| 14|
|   2|   9| 10| 15|[10, 15]| 10| 15|
|   2|  11| 12| 16|[12, 16]| 12| 16|
|   2|  13| 14| 17|[14, 17]| 14| 17|
+----+----+---+---+--------+---+---+
tst_exp = tst_trans.select(tst_trans.columns+[F.col('str_col.*')])
Sign up to request clarification or add additional context in comments.

1 Comment

Happy to hear. Can you also upvote the answer, which is the common community practice

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.