0

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is based on strings so I am trying to convert any datatype other than string to string. I am facing most of the issues in date columns conversion since date format need to be changed before converting into string to make it consistent for hash based matching.Please help me with the approach.

#Identify the fields which are not strings
from pyspark.sql.types import *
fields = df_db1.schema.fields
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))

#Convert the date fields to specific date format and convert to string.
DateFields = map(lambda f: col(f.name), filter(lambda f: isistance(f.dataType, DateType), fields))

#convert all other fields other than string to string.
0

1 Answer 1

1

For numeric and date fields you can use cast

#filter rows
DateFields = filter(lambda f: isinstance(f.dataType, DateType), fields)

# cast to string
dateFieldsWithCast = map(lambda f: col(f).cast("string").as(f.name), DateFields)

In analogic way you may create list of columns with Long type, etc. and then do select as in this answer

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.