How to convert any datatype other than string to string in pyspark dataframe

Question

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is based on strings so I am trying to convert any datatype other than string to string. I am facing most of the issues in date columns conversion since date format need to be changed before converting into string to make it consistent for hash based matching.Please help me with the approach.

#Identify the fields which are not strings
from pyspark.sql.types import *
fields = df_db1.schema.fields
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))

#Convert the date fields to specific date format and convert to string.
DateFields = map(lambda f: col(f.name), filter(lambda f: isistance(f.dataType, DateType), fields))

#convert all other fields other than string to string.

Bala · Accepted Answer · 2018-02-02 21:05:35Z

1

For numeric and date fields you can use cast

#filter rows
DateFields = filter(lambda f: isinstance(f.dataType, DateType), fields)

# cast to string
dateFieldsWithCast = map(lambda f: col(f).cast("string").as(f.name), DateFields)

In analogic way you may create list of columns with Long type, etc. and then do select as in this answer

edited Feb 2, 2018 at 21:05

Bala

11.3k19 gold badges74 silver badges133 bronze badges

answered Feb 2, 2018 at 18:35

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to convert any datatype other than string to string in pyspark dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related