Pyspark remove duplicates base 2 columns

Question

I have the next df in pyspark:

+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname|  ncf|      date|salary|
+---------+----------+--------+-----+----------+------+
|    James|          |       V|36636|2021-09-03|  3000| remove
|  Michael|      Rose|        |40288|2021-09-10|  4000|
|   Robert|          |Williams|42114|2021-08-03|  4000|
|    Maria|      Anne|   Jones|39192|2021-05-13|  4000|
|      Jen|      Mary|   Brown|     |2020-09-03|    -1|
|    James|          |   Smith|36636|2021-09-03|  3000| remove
|    James|          |   Smith|36636|2021-09-04|  3000|
+---------+----------+--------+-----+----------+------+

I need remove rows where ncf and date were equal. The df result will be:

+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname|  ncf|      date|salary|
+---------+----------+--------+-----+----------+------+
|  Michael|      Rose|        |40288|2021-09-10|  4000|
|   Robert|          |Williams|42114|2021-08-03|  4000|
|    Maria|      Anne|   Jones|39192|2021-05-13|  4000|
|      Jen|      Mary|   Brown|     |2020-09-03|    -1|
|    James|          |   Smith|36636|2021-09-04|  3000|
+---------+----------+--------+-----+----------+------+

Have you tried Distinct() on it.

user17243995
– user17243995

2021-10-25 17:34:42 +00:00
Commented Oct 25, 2021 at 17:34 — user17243995
– user17243995, Commented Oct 25, 2021 at 17:34

greenie · Accepted Answer · 2021-10-25 18:11:39Z

1

dropDuplicates method helps with removing duplicates with in a subset of columns.

df.dropDuplicates(['ncf', 'date'])

answered Oct 25, 2021 at 18:11

greenie

4443 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pltc Over a year ago

I didn't know dropDuplicates exists, so good to know! However OP wants to remove both rows, so in this case dropDuplicates wouldn't work

pltc · Accepted Answer · 2021-10-25 18:03:00Z

You can use window functions to count if there are two or more rows with your conditions

from pyspark.sql import functions as F
from pyspark.sql import Window as W


df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)

# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname|  ncf|      date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# |      Jen|      Mary|   Brown|     |2020-09-03|    -1|     false|
# |    James|          |       V|36636|2021-09-03|  3000|      true|
# |    James|          |   Smith|36636|2021-09-03|  3000|      true|
# |  Michael|      Rose|        |40288|2021-09-10|  4000|     false|
# |   Robert|          |Williams|42114|2021-08-03|  4000|     false|
# |    James|          |   Smith|36636|2021-09-04|  3000|     false|
# |    Maria|      Anne|   Jones|39192|2021-05-13|  4000|     false|
# +---------+----------+--------+-----+----------+------+----------+

You now can use duplicated to filter rows as desired.

Collectives™ on Stack Overflow

Pyspark remove duplicates base 2 columns

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related