2

I am trying to de-duplicate rows in pandas. I have millions of rows of duplicates and it isn't suitable for what I'm trying to do.

From this:

   col1  col2
0     1     23
1     1     47
2     1     58
3     1     9
4     1     4

I want to get this:

   col1  col2
0     1     [23, 47, 58, 9, 4]

I have managed to do it manually by writing individual scripts for each spreadsheet, but it would really be great to have a more generalized way of doing it.

So far I've tried:

 def remove_duplicates(self, df):
        ids = df[self.key_field].unique()        
        numdicts = []
        for i in ids:
            instdict = {self.key_field: i}            
            for col in self.deduplicate_fields:
                xf = df.loc[df[self.key_field] == i]                    
                instdict[col] = str(list(xf[col]))
            numdicts.append(instdict)

        for n in numdicts:
            print(pd.DataFrame(data=n, index=self.key_field))
        return df

But unbelievably, this returns the same thing I started with.

The only way I've managed it so far is to create lists for each column manually and loop through the unique index keys from the dataframe, and add all of the duplicates to a list, then zip all of the lists and create a dataframe from them.

However, this doesn't seem to work when there is an unknown number of columns that need to be de-duplicated.

Any better way of doing this would be appreciated!

Thanks in advance!

2
  • For millions of rows, do you really want to put lists in a dataframe? The efficiency gain of reducing rows can easily be lost by all those pointers in lists. In addition, you lose the ability to do vectorised calculations. Commented Apr 19, 2018 at 12:48
  • Interesting, thanks ! Unfortuntaely I'm stuck with it this way because of restrictions we have with a development provider we are using. Commented Apr 19, 2018 at 12:53

3 Answers 3

2

Is this what you are looking for when you need one column only:

df.groupby('col1')['col2'].apply(lambda x: list(x)).reset_index()

And for all other columns use agg:

df.groupby('col1').apply(lambda x: list(x)).reset_index()

With agg you can also specify which columns to use:

df.groupby('col1')['col2', 'col3'].apply(lambda x: list(x)).reset_index()
Sign up to request clarification or add additional context in comments.

4 Comments

Wow, that's so weird. From this (when I put a list in the place of "col2", I get the actual column names as the values after the "col1" field... so when I use that, it works perfectly for ONE column. I'm after doing the same for an arbitrary number of columns.
Yes, apply is meant to work for your example and for multiple columns use agg.
Thanks - this is brilliant.
With apply you could just use list... df.groupby('col1').col2.apply(list)
1

You can try the following:

df.groupby('col1').agg(lambda x: list(x))

2 Comments

Thanks, but I get an error: Data must be 1 dimensional
Probably because you already have lists in col2, you should use list instead then. I will update my answer.
-1

For multiple columns it should look like this instead to avoid errors:

df.groupby('col1')[['col2','col3']].agg(lambda x: list(x)).reset_index()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.