I am trying to de-duplicate rows in pandas. I have millions of rows of duplicates and it isn't suitable for what I'm trying to do.
From this:
col1 col2
0 1 23
1 1 47
2 1 58
3 1 9
4 1 4
I want to get this:
col1 col2
0 1 [23, 47, 58, 9, 4]
I have managed to do it manually by writing individual scripts for each spreadsheet, but it would really be great to have a more generalized way of doing it.
So far I've tried:
def remove_duplicates(self, df):
ids = df[self.key_field].unique()
numdicts = []
for i in ids:
instdict = {self.key_field: i}
for col in self.deduplicate_fields:
xf = df.loc[df[self.key_field] == i]
instdict[col] = str(list(xf[col]))
numdicts.append(instdict)
for n in numdicts:
print(pd.DataFrame(data=n, index=self.key_field))
return df
But unbelievably, this returns the same thing I started with.
The only way I've managed it so far is to create lists for each column manually and loop through the unique index keys from the dataframe, and add all of the duplicates to a list, then zip all of the lists and create a dataframe from them.
However, this doesn't seem to work when there is an unknown number of columns that need to be de-duplicated.
Any better way of doing this would be appreciated!
Thanks in advance!