Python Pandas: concatenate rows with unique values

Question

In Python pandas I have a large data frame that looks like this:

df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})


     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2

It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:

     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2

We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.

One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:

df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))

One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.

Any suggestions?

Just an aside you don't need to pass the columns as df columns the following would suffice: df.groupby(['a', 'b', 'c']).apply(lambda x: "{%s}" % ', '.join(x.d)) — EdChum
– EdChum, Commented Nov 27, 2014 at 15:24
Thanks! The reason I used df-notation was because originally these names are really long, so I take advantage of autocompletion mechanism. To spare others the details I replaced them with letters. — Lukasz Tracewski
– Lukasz Tracewski, Commented Nov 27, 2014 at 21:28

roman · Accepted Answer · 2014-11-27 15:50:05Z

3

I think you can do something like this:

>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

Another way to do this and not to list all column but only list ones with unique values

>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

edited Nov 27, 2014 at 15:50

answered Nov 27, 2014 at 15:35

roman

118k30 gold badges205 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DSM Over a year ago

Note that lambda col: ','.join(col) can be written as ','.join.. but on second thought, I'm not sure that will work, because I'm not sure how strict the OP wants to be with "unique". (I.e. we might have to get rid of the two ws if it were 'w,w,y' instead.)

Lukasz Tracewski Over a year ago

Thanks, worked great! All values within a group are unique, so the ','.join can work as well. Thanks for suggestion!

MMelnicki · Accepted Answer · 2019-03-26 18:56:05Z

3

you could use df.pivot_table(), although it appears to be slightly slower than df.groupby() (as suggested by Roman's answer):

>>> %timeit df.pivot_table(index=['a','b','c'], values=['c','d','e'], aggfunc=lambda x: ','.join(x)).reset_index()
6.17 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

>>> %timeit df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col)).reset_index()
4.09 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, if you want the new columns to contain an ACTUAL list (and not a comma-separated list-as-string), you can replace the lambda function ','.join(x) with list(x). And if you want the list to only include unique elements, you can change the lambda function to list(set(x)).

answered Mar 26, 2019 at 18:56

MMelnicki

7222 gold badges8 silver badges14 bronze badges

1 Comment

Lukasz Tracewski Over a year ago

Roman's "correct answer" still stands, but nowadays I'd indeed use this approach.

Collectives™ on Stack Overflow

Python Pandas: concatenate rows with unique values

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related