how find the identical rows in a DataFrame -- python

Question

I have a DataFrame A as follows, and I want to find the rows with the same values in their first 3 columns.

import pandas as pd
import io
import numpy as np
import datetime
A= """
   c0   c1   c2   c3   c4
0  1    a    d    3    4
1  1    a    c    0    0
2  1    a    d    3    1
3  1    b    c    0    0
4  2    b    d    8    5
5  2    b    d    3    3
    """

df = pd.read_csv(io.StringIO(A), delimiter='\s+')
df2= pd.DataFrame(df, columns=['c0', 'c1', 'c2'])
for i in range(0,4):
    row1 = df2.irow(i)
    row2 = df2.irow(i+1)
    val=all(unique_columns = row1 != row2)   
    print(i)

I want it to print 2, 5.

Well, this does not work, even if it would it couldn't get the rows that are following eachother.

Alternatively, I tried np.unique(df2), to see if the number of columns are different from df2, which didn't work either.

Any help is appreciated.

...but only the row 2 has the same values in c0-c2 as the row 0, row 6 does not. — CT Zhu
– CT Zhu, Commented Nov 9, 2015 at 16:59

EdChum · Accepted Answer · 2015-11-09 17:11:55Z

4

IIUC then use duplicated:

In [132]:
df2.index[df2.duplicated()]

Out[132]:
Int64Index([2, 6], dtype='int64')

So this works because it detects when any row has duplicate values, as df2 is a subset of the cols of interest then all columns are tested.

EDIT

df2 seems superfluous here you can just do:

In [133]:
df.index[df.duplicated(subset=['c0', 'c1', 'c2'])]

Out[133]:
Int64Index([2, 6], dtype='int64')

edited Nov 9, 2015 at 17:11

answered Nov 9, 2015 at 17:02

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Leb Over a year ago

Possibly include subset since only first 3 columns are needed.

Leb Over a year ago

You're right, the OP might need to consider removing df2 to prevent unnecessary steps and possibly doubling the data

luismf · Accepted Answer · 2015-11-09 17:01:57Z

1

In [211]: a.groupby(['c0','c1','c2']).indices
Out[211]:
{(1, 'a', 'c'): array([1]),
 (1, 'a', 'd'): array([0, 2]),
 (1, 'b', 'c'): array([3]),
 (2, 'b', 'd'): array([4, 5])}

This should do the trick.

answered Nov 9, 2015 at 17:01

luismf

3711 silver badge7 bronze badges

1 Comment

Ana Over a year ago

This is great for when you actually care about the groups and want to categorize your data. Thanks.

Collectives™ on Stack Overflow

how find the identical rows in a DataFrame -- python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related