I'm dealing with a very large Data Frame and I'm using pandas to do the analysis.
The data frame is structured as follows
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
Source Target Weight
0 0 25846 1
1 0 1916 1
2 25846 0 1
3 0 4748 1
4 0 16856 1
The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row.
For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.
A simple way to get rid of all the "duplicates" is
for index, row in df.iterrows():
df = df[~((df.Source==row.Target)&(df.Target==row.Source))]
However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?