1

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis. The data frame is structured as follows

import pandas as pd

df = pd.read_csv("data.csv")
df.head()

    Source  Target  Weight
0       0   25846       1
1       0    1916       1
2   25846       0       1
3       0    4748       1
4       0   16856       1

The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row. For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.

A simple way to get rid of all the "duplicates" is

for index, row in df.iterrows():
    df = df[~((df.Source==row.Target)&(df.Target==row.Source))]

However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?

3 Answers 3

4

Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:

import numpy as np

import pandas as pd

df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])

df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)

df[~df[["T1", "T2"]].duplicated()]
Sign up to request clarification or add additional context in comments.

Comments

0

No need (as usual) to use a loop with a dataframe. Use the Series.isin method:

So start with this:

df = pandas.DataFrame({
    'src': [0, 0, 25, 0, 0],
    'tgt': [25, 12, 0, 85, 363]
})

print(df)



src  tgt
0    0   25
1    0   12
2   25   0
3    0   85
4    0  363

Then select all of the where src is not in tgt:

df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]

   src  tgt
1    0   12
3    0   85
4    0  363

4 Comments

This will remove rows that aren't really duplicates. Consider df = pd.DataFrame({"src": [1,2,3], "tgt": [2,3,1]}).
@DSM I don't think the OP means "duplicate" the way most pandas power users mean it. Given the limited amount of sample data provided, this reproduces the output from the OP's code.
Yeah, I think I get that, but IIUC the OP wants to remove [2,1] if he's already seen [1,2]. The example I gave doesn't have any "duplicates" in that sense, but your code removes them all.
@DSM I think you're right -- but I'll leave this here as-is since it is still a loopless improvement over the original code.
0

Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.

df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)

>>> df
   Source  Weight
0   25846       1
1    1916       1
3    4748       1
4   16856       1

2 Comments

Even if that made sense (say Source and Target are position labels, and Weight is a distance: then it wouldn't really make sense to add Source and Target), what if there are Source/Target rows of 0,2 and 1,1? [Oh, wait, sorry -- you're assuming that doesn't happen. I still don't think this makes sense in the OP's context, but you've explicitly ruled out that case.]
It is based on the assumption that there is only one value in a give source/target row, the other being zero. It is true in the example above with 5 rows, but obviously depends on knowledge of how the data was encoded. If the assumption holds, this method should be very efficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.