3

I am a python/pandas user and have a question about it. I have an Excel file as below.

   C1  C2  C3  C4     C5     C6  ID  Value
0  aa  ee  ii  mm  aaaaa   bbbb   1    100
1  bb  ff  jj  nn   cccc  ddddd   2     50
2  aa  ee  ii  mm   eeee   ffff   3     20
3  dd  hh  ll  pp   gggg   hhhh   4     10
4  aa  ee  ii  mm   abcd   efgh   5      5
5  bb  ff  jj  nn  aaaaa   bbbb   6      2

Code to reproduce—

df = pd.DataFrame({'Value': [100,50,20,10,5,2],
'ID': [1,2,3,4,5,6],
'C1': ['aa','bb','aa','dd','aa','bb'],
'C2': ['ee','ff','ee','hh','ee','ff'],
'C3': ['ii','jj','ii','ll','ii','jj'],
'C4': ['mm','nn','mm','pp','mm','nn'],
'C5': ['aaaaa','cccc','eeee','gggg','abcd','aaaaa'],
'C6': ['bbbb','ddddd','ffff','hhhh','efgh','bbbb']})

Some rows are duplicates in column1-4 (ex. ID1, ID3 and ID5 or ID2 and ID6 are duplicates). Is there any way to combine duplicate rows? (I am focusing on column1-4 and I do not care about column 5&6)

I want to combine the "Value" of the duplicate rows and leave the top column's sequence. For example, here is output file which I want to make.

    Value   ID  C1  C2  C3  C4  C5      C6
0   125     1   aa  ee  ii  mm  aaaaa   bbbb
1   52      2   bb  ff  jj  nn  cccc    ddddd
2   10      4   dd  hh  ll  pp  gggg    hhhh

If you could give me your opinion, I would be grateful for that very much.

0

2 Answers 2

3

There may be other efficient way, one way may be to:

  • Create new_df such that it keeps unique values in Column1 with first occurences.

  • Then, in original df getting sum after grouping by Column1 and updating the value of new_df

You can try as shown below:

new_df = df.drop_duplicates(subset='Column1', keep='first').reset_index()
del new_df['index'] # remove extra index column after reset index
new_df['Value'] = df.groupby('Column1', as_index=False).sum()['Value']
print(new_df)

Result:

   ID  Value Column1 Column2 Column3 Column4 Column5 Column6
0   1    125      aa      ee      ii      mm   aaaaa    bbbb
1   2     52      bb      ff      jj      nn    cccc   ddddd
2   4     10      dd      hh      ll      pp    gggg    hhhh

Update:

Checking with dataframe after edited:

new_df = df.drop_duplicates(subset='C1', keep='first').reset_index()
del new_df['index']
new_df['Value'] = df.groupby('C1', as_index=False).sum()['Value']
print(new_df)

Result:

   C1  C2  C3  C4     C5     C6  ID  Value
0  aa  ee  ii  mm  aaaaa   bbbb   1    125
1  bb  ff  jj  nn   cccc  ddddd   2     52
2  dd  hh  ll  pp   gggg   hhhh   4     10
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much!
1

You can use groupby.agg. I assume you wish to sum value and take the first id for each group, as in your desired output. Here's a minimal example:

df = pd.DataFrame([[100, 1, 'a', 'b'], [20, 2, 'a', 'b'],
                   [15, 3, 'c', 'd'], [5, 4, 'a', 'b'],
                   [25, 5, 'c', 'd']], columns=['value', 'id', 'col1', 'col2'])

res = df.groupby(['col1', 'col2']).agg({'id': 'first', 'value': sum}).reset_index()

print(res)

  col1 col2  id  value
0    a    b   1    125
1    c    d   3     40

3 Comments

Thanks, jpp. However, I would like to leave the ID, column5 and column6 of the top row in the original file.
@Tom_Hanks, Then just add to your dictionary, e.g. 'col5': 'first', etc, if you wish to keep (any number of) other columns.
@jpp sure. Sometimes OPs are beginners and might have trouble generalizing solutions, but I think this is pretty straight forward ;}

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.