0

Let's say I am trying to find how many duplicates I have for a pair of values in a table. The columns are "A" and "B" I can do

select A, B, count(*) as counter from table group by A, B

In fact, I could also do

select A, B from (select A, B, count(*) as counter from table group by A, B) where counter >= 2

to only deal with values that have n duplicates.

How can I do the same in pandas?

I can do

df.groupby(["A", "B"].count(), 

but that gives me every element, I only want to limit to those where count>=2

For example if I have:

   A  B  C
0  x  a  1
1  x  a  1
2  x  b  2
3  y  b  3
4  y  a  1

I want to identify the first two columns because groupby() gives count of 2 (the pair (x,a) is repeated). I would like to do the same for any value, not just 2.

1
  • Can you show us some sample data ? Commented Apr 25, 2019 at 2:32

1 Answer 1

1

Seems like you can do filter after groupby

df.groupby(["A", "B"])['A'].count().loc[lambda x : x>2]

Update duplicated

df[df.duplicated(['A','B'],keep=False)]
Out[1178]: 
   A  B  C
0  x  a  1
1  x  a  1

transform for different n

n=2

df[df.groupby(['A','B'])['A'].transform('count')==n]
Sign up to request clarification or add additional context in comments.

6 Comments

What if I want the count =3, or something larger?
Thank you! Why do I need ['A"] before .transform? What does that do? Can I put any column?
@user that is just require for count , you need something to count .
So, I guess there is no option like * in sql? What if my column had a null in it?
@user if you need count all you can df.groupby(['A','B']).transform('count')
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.