Replacing Duplicate Strings in Pandas Dataframe

Question

I have a dataframe df

Name            Reagent
0   Experiment1 water
1   Experiment1 oil
2   Experiment1 water
3   Experiment1 milk
4   Experiment1 water
5   Experiment1 tea
6   Experiment1 water
7   Experiment1 coffee
8   Experiment2 water
9   Experiment2 coffee

I want to replace duplicate names within the same experiment with a differentiator of some sort. In the example only water is duplicated within a given experiment.

e.g

   Name         Reagent
0   Experiment1 water1
1   Experiment1 oil
2   Experiment1 water2
3   Experiment1 milk
4   Experiment1 water3
5   Experiment1 tea
6   Experiment1 water4
7   Experiment1 coffee
8   Experiment2 water
9   Experiment2 coffee

Thanks for any help

End genocide - save Gaza · Accepted Answer · 2019-04-03 12:37:35Z

3

Solution: append all values with the GroupBy.cumcount as a counter (and replace 0 values with empty strings to ignore each first dupe):

df['Reagent'] += df.groupby(['Name','Reagent']).cumcount().astype(str).replace('0','')
print (df)
          Name Reagent
0  Experiment1   water
1  Experiment1     oil
2  Experiment1  water1
3  Experiment1    milk
4  Experiment1  water2
5  Experiment1     tea
6  Experiment1  water3
7  Experiment1  coffee
8  Experiment2   water
9  Experiment2  coffee

If need replace only all dupes by both columns filter rows by DataFrame.duplicated by both columns and add 1:

mask = df.duplicated(['Name','Reagent'], keep=False)
df.loc[mask, 'Reagent'] += df[mask].groupby(['Name','Reagent']).cumcount().add(1).astype(str)
print (df)
          Name Reagent
0  Experiment1  water1
1  Experiment1     oil
2  Experiment1  water2
3  Experiment1    milk
4  Experiment1  water3
5  Experiment1     tea
6  Experiment1  water4
7  Experiment1  coffee
8  Experiment2   water
9  Experiment2  coffee

edited Apr 3, 2019 at 12:37

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

answered Apr 3, 2019 at 12:11

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user11305439 Over a year ago

Oh wow, that was quick. Please could you give a brief description of what the line is doing. How would I put a hypen in between the number?

jezrael Over a year ago

@ukemi - Thank you.

jezrael Over a year ago

@user11305439 - Sorry, dont see full comment. Use df['Reagent'] += '-' + df.groupby(['Name','Reagent']).cumcount().astype(str).replace('0','')

jezrael Over a year ago

@user11305439 - Or df.loc[mask, 'Reagent'] += '-' + df[mask].groupby(['Name','Reagent']).cumcount().add(1).astype(str)

Collectives™ on Stack Overflow

Replacing Duplicate Strings in Pandas Dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related