How to compare 2 columns in pyspark dataframe using asserts functions

Question

I am using the below code to compare 2 columns in data frame. I dont want to do it in pandas. Can someone help how to compare using spark data frames?

    df1=context.spark.read.option("header",True).csv("./test/input/test/Book1.csv",) 
    df1=df1.withColumn("Curated", dataclean.clean_email(col("email")))
    df1.show()
    assert_array_almost_equal(df1['expected'], df1['Curated'],verbose=True)

abiratsis · Accepted Answer · 2022-09-18 10:45:02Z

1

One efficient way would be to try to identify the first difference as soon as possible. One way to achieve that is via left-anti joins:

assert(df1.join(df1, (df1['expected'] == df1['Curated']), "leftanti").first() != None)

answered Sep 18, 2022 at 10:45

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to compare 2 columns in pyspark dataframe using asserts functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related