6

In Python pandas I have a large data frame that looks like this:

df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})


     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2

It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:

     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2

We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.

One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:

df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))

One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.

Any suggestions?

2
  • Just an aside you don't need to pass the columns as df columns the following would suffice: df.groupby(['a', 'b', 'c']).apply(lambda x: "{%s}" % ', '.join(x.d)) Commented Nov 27, 2014 at 15:24
  • Thanks! The reason I used df-notation was because originally these names are really long, so I take advantage of autocompletion mechanism. To spare others the details I replaced them with letters. Commented Nov 27, 2014 at 21:28

2 Answers 2

3

I think you can do something like this:

>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

Another way to do this and not to list all column but only list ones with unique values

>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2
Sign up to request clarification or add additional context in comments.

2 Comments

Note that lambda col: ','.join(col) can be written as ','.join.. but on second thought, I'm not sure that will work, because I'm not sure how strict the OP wants to be with "unique". (I.e. we might have to get rid of the two ws if it were 'w,w,y' instead.)
Thanks, worked great! All values within a group are unique, so the ','.join can work as well. Thanks for suggestion!
3

you could use df.pivot_table(), although it appears to be slightly slower than df.groupby() (as suggested by Roman's answer):

>>> %timeit df.pivot_table(index=['a','b','c'], values=['c','d','e'], aggfunc=lambda x: ','.join(x)).reset_index()
6.17 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

>>> %timeit df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col)).reset_index()
4.09 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, if you want the new columns to contain an ACTUAL list (and not a comma-separated list-as-string), you can replace the lambda function ','.join(x) with list(x). And if you want the list to only include unique elements, you can change the lambda function to list(set(x)).

1 Comment

Roman's "correct answer" still stands, but nowadays I'd indeed use this approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.