1

I have a dataframe shown below

     pid          date      label   age
0   0001    2001-09-24      False   34
1   0001    2006-03-02      True    39
2   0001    2006-03-02      True    39
3   0002    2003-02-07      True    23
4   0002    2004-08-02      True    24
5   0002    2004-08-02      False   24
6   0003    2001-05-25      False   25
7   0003    2001-05-25      False   25

Which I want to aggregate based on both pid and date, where if multiple rows have the same pid and data then age stays the same and label = True if any or all of these rows have True label:

     pid          date      label   age
0   0001    2001-09-24      False   34
1   0001    2006-03-02      True    39
3   0002    2003-02-07      True    23
4   0002    2004-08-02      True    24
5   0003    2001-05-25      False   25

I have tried groupby but all I could do is with one column - pid and no success with two columns.

1 Answer 1

1

In your case sort_values + drop_duplicates

out = df.sort_values('label').drop_duplicates(['pid','date','age'],keep='last').sort_index()
Out[240]: 
   pid        date  label  age
0    1  2001-09-24  False   34
2    1  2006-03-02   True   39
3    2  2003-02-07   True   23
4    2  2004-08-02   True   24
7    3  2001-05-25  False   25
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.