Creating lists out of multiple duplicate strings in pandas dataframe

Question

I am trying to de-duplicate rows in pandas. I have millions of rows of duplicates and it isn't suitable for what I'm trying to do.

From this:

   col1  col2
0     1     23
1     1     47
2     1     58
3     1     9
4     1     4

I want to get this:

   col1  col2
0     1     [23, 47, 58, 9, 4]

I have managed to do it manually by writing individual scripts for each spreadsheet, but it would really be great to have a more generalized way of doing it.

So far I've tried:

 def remove_duplicates(self, df):
        ids = df[self.key_field].unique()        
        numdicts = []
        for i in ids:
            instdict = {self.key_field: i}            
            for col in self.deduplicate_fields:
                xf = df.loc[df[self.key_field] == i]                    
                instdict[col] = str(list(xf[col]))
            numdicts.append(instdict)

        for n in numdicts:
            print(pd.DataFrame(data=n, index=self.key_field))
        return df

But unbelievably, this returns the same thing I started with.

The only way I've managed it so far is to create lists for each column manually and loop through the unique index keys from the dataframe, and add all of the duplicates to a list, then zip all of the lists and create a dataframe from them.

However, this doesn't seem to work when there is an unknown number of columns that need to be de-duplicated.

Any better way of doing this would be appreciated!

Thanks in advance!

For millions of rows, do you really want to put lists in a dataframe? The efficiency gain of reducing rows can easily be lost by all those pointers in lists. In addition, you lose the ability to do vectorised calculations. — jpp
– jpp, Commented Apr 19, 2018 at 12:48
Interesting, thanks ! Unfortuntaely I'm stuck with it this way because of restrictions we have with a development provider we are using. — OneMatzo
– OneMatzo, Commented Apr 19, 2018 at 12:53

zipa · Accepted Answer · 2018-04-19 13:12:27Z

2

Is this what you are looking for when you need one column only:

df.groupby('col1')['col2'].apply(lambda x: list(x)).reset_index()

And for all other columns use agg:

df.groupby('col1').apply(lambda x: list(x)).reset_index()

With agg you can also specify which columns to use:

df.groupby('col1')['col2', 'col3'].apply(lambda x: list(x)).reset_index()

edited Apr 19, 2018 at 13:12

answered Apr 19, 2018 at 12:52

zipa

28k6 gold badges45 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

OneMatzo Over a year ago

Wow, that's so weird. From this (when I put a list in the place of "col2", I get the actual column names as the values after the "col1" field... so when I use that, it works perfectly for ONE column. I'm after doing the same for an arbitrary number of columns.

zipa Over a year ago

Yes, apply is meant to work for your example and for multiple columns use agg.

OneMatzo Over a year ago

Thanks - this is brilliant.

piRSquared Over a year ago

With apply you could just use list... df.groupby('col1').col2.apply(list)

iDrwish · Accepted Answer · 2018-04-19 13:12:07Z

1

You can try the following:

df.groupby('col1').agg(lambda x: list(x))

edited Apr 19, 2018 at 13:12

answered Apr 19, 2018 at 12:53

iDrwish

3,1131 gold badge18 silver badges24 bronze badges

2 Comments

OneMatzo Over a year ago

Thanks, but I get an error: Data must be 1 dimensional

iDrwish Over a year ago

Probably because you already have lists in col2, you should use list instead then. I will update my answer.

Nils Schwebel · Accepted Answer · 2021-11-09 22:16:49Z

-1

For multiple columns it should look like this instead to avoid errors:

df.groupby('col1')[['col2','col3']].agg(lambda x: list(x)).reset_index()

edited Nov 9, 2021 at 22:16

Nils Schwebel

6865 silver badges16 bronze badges

answered Nov 9, 2021 at 18:09

radmads

1

Collectives™ on Stack Overflow

Creating lists out of multiple duplicate strings in pandas dataframe

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related