Python DataFrame group by column while making new columns

Question

I have a DataFrame in pandas that looks like

memberid	ticketid	agegrp	c1	c2	c3	d1	date
1	1	1	A	A		AA	2019-01-01
1	1	1	A	A		AB	2019-01-02
1	1	1	A	A	C	AC	2019-01-01
1	2	1	A	D	C	AA	2019-02-01
1	2	1	A	D	C	AC	2019-02-01
2	3	4	C	A	C	CA	2019-03-01
2	3	4	C	A	C	CD	2019-03-01
2	3	4	C	A	C	BB	2019-03-01
2	3	4	C	A	C	AA	2019-03-02

df = pd.DataFrame( {
   'memberiD': [1,1,1,1,1,2,2,2,2],
   'ticketid': [1,1,1,2,2,3,3,3,3],
   'agegrp': [1,1,1,1,4,4,4,4],
   'c1': ['a','a','a','a','a','c','c','c','c'],
   'c2': ['a','a','a','d','d','a','a','a','a'],
   'c3': ['','','c','c','c','c','c','c','c'],
   'd1': ['aa','ab','ac','aa','ac','ca','cd','bb','aa']  
    } );

I want to group by ticketid so that one ticket id gets represented on exactly one row. For each ticketid, memberid and agegrp should be exactly the same. For the c1,c2,c3 in a ticketid, just pull the most-frequent distinct 3 that appear - in case of a tie, any of the top 3 is fine. For all d1 in any one ticketid, take the most-frequent distinct 3 that appear, and insert them into columns d1,d2,d3 - similar to the c1,c2,c3 columns: if there is a tie, any of the top 3 is fine. For date, just select the earliest date that appears for any ticketid.

So, a resulting dataframe could be:

memberid	ticketid	agegrp	c1	c2	c3	d1	d2	d3	date
1	1	1	A	C		AA	AB	AC	2019-01-01
1	2	1	A	D	C	AA	AC		2019-02-01
1	3	4	C	A		CA	CD	BB	2019-03-01

I tried looking at indexing on ticketid, but I'm not exactly sure how to make the new columns with that indexing...although I'm not sure if this approach is correct in general.

Nk03 · Accepted Answer · 2021-04-28 03:13:38Z

1

Do you want this?

from statistics import mode
from collections import Counter

final_df =df.groupby('ticketid', as_index=False).agg({'memberid': mode,'c1':mode, 'c2': mode, 'c3': mode,'date': min,'d1': list})
final_df['d1']  = final_df.d1.apply(lambda x: ','.join(list(Counter(x))[:3]) if len(x) >= 3 else ','.join(x))
final_df[['d1','d2','d3']] = final_df['d1'].str.split(',', expand=True)

Output -

   ticketid  memberid c1 c2   c3        date  d1  d2    d3
0         1         1  A  A  NaN  2019-01-01  AA  AB    AC
1         2         1  A  D    C  2019-02-01  AA  AC  None
2         3         2  C  A    C  2019-03-01  CA  CD    BB

edited Apr 28, 2021 at 3:13

answered Apr 27, 2021 at 21:05

Nk03

15k2 gold badges11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python DataFrame group by column while making new columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related