Modify duplicated rows in dataframe (Python)

Question

I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:

      City           Year       Restaurants
0   New York         2001       20
1      Paris         2000       40
2   New York         1999       41
3   Los Angeles      2004       35
4     Madrid         2001       22
5   New York         1998       33
6   Barcelona        2001       15

As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:

      City           Year       Restaurants
0   New York 2001    2001       20
1      Paris         2000       40
2   New York 1999    1999       41
3   Los Angeles      2004       35
4     Madrid         2001       22
5   New York 1998    1998       33
6   Barcelona        2001       15

I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.

sophocles · Accepted Answer · 2021-12-28 19:58:12Z

3

Use np.where, to modify column City if duplicated

df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])

edited Dec 28, 2021 at 19:58

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

answered Dec 28, 2021 at 19:49

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sophocles · Accepted Answer · 2021-12-28 20:19:58Z

2

A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.

df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)

            City  Year  Restaurants
0     New York 1  2001           20
1        Paris 1  2000           40
2     New York 2  1999           41
3  Los Angeles 1  2004           35
4       Madrid 1  2001           22
5     New York 3  1998           33
6    Barcelona 1  2001           15

To have an increment only in the duplicate cases you can use loc:

df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)

          City  Year  Restaurants
0   New York 1  2001           20
1        Paris  2000           40
2   New York 2  1999           41
3  Los Angeles  2004           35
4       Madrid  2001           22
5   New York 3  1998           33
6    Barcelona  2001           15

edited Dec 28, 2021 at 20:19

answered Dec 28, 2021 at 20:04

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

1 Comment

nokvk Over a year ago

Fantastic, thanks a lot!

Collectives™ on Stack Overflow

Modify duplicated rows in dataframe (Python)

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related