How to efficiently remove duplicate rows from a DataFrame

Question

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis. The data frame is structured as follows

import pandas as pd

df = pd.read_csv("data.csv")
df.head()

    Source  Target  Weight
0       0   25846       1
1       0    1916       1
2   25846       0       1
3       0    4748       1
4       0   16856       1

The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row. For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.

A simple way to get rid of all the "duplicates" is

for index, row in df.iterrows():
    df = df[~((df.Source==row.Target)&(df.Target==row.Source))]

However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?

HYRY · Accepted Answer · 2016-03-08 00:28:21Z

4

Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:

import numpy as np

import pandas as pd

df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])

df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)

df[~df[["T1", "T2"]].duplicated()]

answered Mar 8, 2016 at 0:28

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Paul H · Accepted Answer · 2016-03-08 00:43:07Z

0

No need (as usual) to use a loop with a dataframe. Use the Series.isin method:

So start with this:

df = pandas.DataFrame({
    'src': [0, 0, 25, 0, 0],
    'tgt': [25, 12, 0, 85, 363]
})

print(df)



src  tgt
0    0   25
1    0   12
2   25   0
3    0   85
4    0  363

Then select all of the where src is not in tgt:

df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]

   src  tgt
1    0   12
3    0   85
4    0  363

edited Mar 8, 2016 at 0:43

answered Mar 8, 2016 at 0:19

Paul H

68.7k23 gold badges165 silver badges139 bronze badges

4 Comments

DSM Over a year ago

This will remove rows that aren't really duplicates. Consider df = pd.DataFrame({"src": [1,2,3], "tgt": [2,3,1]}).

Paul H Over a year ago

@DSM I don't think the OP means "duplicate" the way most pandas power users mean it. Given the limited amount of sample data provided, this reproduces the output from the OP's code.

DSM Over a year ago

Yeah, I think I get that, but IIUC the OP wants to remove [2,1] if he's already seen [1,2]. The example I gave doesn't have any "duplicates" in that sense, but your code removes them all.

Paul H Over a year ago

@DSM I think you're right -- but I'll leave this here as-is since it is still a loopless improvement over the original code.

Alexander · Accepted Answer · 2016-03-08 00:49:30Z

0

Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.

df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)

>>> df
   Source  Weight
0   25846       1
1    1916       1
3    4748       1
4   16856       1

answered Mar 8, 2016 at 0:49

Alexander

111k32 gold badges212 silver badges208 bronze badges

2 Comments

DSM Over a year ago

Even if that made sense (say Source and Target are position labels, and Weight is a distance: then it wouldn't really make sense to add Source and Target), what if there are Source/Target rows of 0,2 and 1,1? [Oh, wait, sorry -- you're assuming that doesn't happen. I still don't think this makes sense in the OP's context, but you've explicitly ruled out that case.]

Alexander Over a year ago

It is based on the assumption that there is only one value in a give source/target row, the other being zero. It is true in the example above with 5 rows, but obviously depends on knowledge of how the data was encoded. If the assumption holds, this method should be very efficient.

Collectives™ on Stack Overflow

How to efficiently remove duplicate rows from a DataFrame

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related