python/dataframe - merge duplicated rows

Question

I have a dataframe like this:

id	year	data_1	data_2
A	2019	nan	11
A	2019	abc	11
A	2020	nan	22
B	2019	345	nan
B	2019	nan	456
B	2020	234	33

I want to identify duplicated rows based on some columns ("id" and "year" in this case) and merge the rest columns of them i.e. for a columns of an id at a year, keep the non-np.nan value:

id	year	data_1	data_2
A	2019	abc	11
A	2020	nan	22
B	2019	345	456
B	2020	234	33

I can find all duplicated rows (which is easy) but can't think of how to "merge" by replacing np.nan with values.

@timgeb Ha sorry, if you mean for each id, year, and column, then yes there is always at most one non-nan value. Actually, there will only be two duplicated rows. So there can't be more than 1 non-nan value for each column — Grumpy Civet
– Grumpy Civet, Commented Aug 23, 2021 at 7:56

mozway · Accepted Answer · 2021-08-23 08:31:27Z

2

Something that will work in this particular case is taking the max per group:

df.groupby(['id', 'year'], as_index=False).max()

output:

  id  year  data_1  data_2
0  A  2019   123.0    11.0
1  A  2020     NaN    22.0
2  B  2019   345.0   456.0
3  B  2020   234.0    33.0

However, this might not if you have duplicates without NaNs, in this case please provide an updated example and the rules for merging.

Here is a quick fix of the above for mixed types. Convert to string, do the merge, convert back to float. However, mixed types in a single column is not really good practice.

(df.fillna('').astype(str)
   .groupby(['id', 'year'], as_index=False).max()
   .astype(float, errors='ignore')
   .replace('', float('nan'))
)

edited Aug 23, 2021 at 8:31

answered Aug 23, 2021 at 8:05

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Grumpy Civet Over a year ago

What if there are non-numerical values like string?

mozway Over a year ago

please provide an example and the expected output

mozway Over a year ago

@GrumpyCivet I provided a fix. Do you really have mixed strings and floats in the same column?

mozway Over a year ago

OK, then fillna with empty string (if not an issue) in the string columns and the first answer will work

mozway Over a year ago

You mean you want to keep the same rows and ffill? If this doesn't help, please open an new question as this is a different problem

|

Collectives™ on Stack Overflow

python/dataframe - merge duplicated rows

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related