0

I have the next df in pyspark:

+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname|  ncf|      date|salary|
+---------+----------+--------+-----+----------+------+
|    James|          |       V|36636|2021-09-03|  3000| remove
|  Michael|      Rose|        |40288|2021-09-10|  4000|
|   Robert|          |Williams|42114|2021-08-03|  4000|
|    Maria|      Anne|   Jones|39192|2021-05-13|  4000|
|      Jen|      Mary|   Brown|     |2020-09-03|    -1|
|    James|          |   Smith|36636|2021-09-03|  3000| remove
|    James|          |   Smith|36636|2021-09-04|  3000|
+---------+----------+--------+-----+----------+------+

I need remove rows where ncf and date were equal. The df result will be:

+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname|  ncf|      date|salary|
+---------+----------+--------+-----+----------+------+
|  Michael|      Rose|        |40288|2021-09-10|  4000|
|   Robert|          |Williams|42114|2021-08-03|  4000|
|    Maria|      Anne|   Jones|39192|2021-05-13|  4000|
|      Jen|      Mary|   Brown|     |2020-09-03|    -1|
|    James|          |   Smith|36636|2021-09-04|  3000|
+---------+----------+--------+-----+----------+------+
1
  • Have you tried Distinct() on it. Commented Oct 25, 2021 at 17:34

2 Answers 2

1

dropDuplicates method helps with removing duplicates with in a subset of columns.

df.dropDuplicates(['ncf', 'date'])
Sign up to request clarification or add additional context in comments.

1 Comment

I didn't know dropDuplicates exists, so good to know! However OP wants to remove both rows, so in this case dropDuplicates wouldn't work
0

You can use window functions to count if there are two or more rows with your conditions

from pyspark.sql import functions as F
from pyspark.sql import Window as W


df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)

# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname|  ncf|      date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# |      Jen|      Mary|   Brown|     |2020-09-03|    -1|     false|
# |    James|          |       V|36636|2021-09-03|  3000|      true|
# |    James|          |   Smith|36636|2021-09-03|  3000|      true|
# |  Michael|      Rose|        |40288|2021-09-10|  4000|     false|
# |   Robert|          |Williams|42114|2021-08-03|  4000|     false|
# |    James|          |   Smith|36636|2021-09-04|  3000|     false|
# |    Maria|      Anne|   Jones|39192|2021-05-13|  4000|     false|
# +---------+----------+--------+-----+----------+------+----------+

You now can use duplicated to filter rows as desired.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.