How to modify duplicated rows in Python pandas

Question

Let's say I have a DataFrame (that I sorted by some priority criterion) with a "name" column. Few names are duplicated, and I want to append a simple indicator to the duplicates.

E.g.,

'jones a'
... 
'jones a'    # this should become 'jones a2'

To get the subset of duplicates, I could do

df.loc[df.duplicated(subset=['name'], take_last=True), 'name']

However, I think the apply function does not allow for inplace modification, right? So what I basically ended up doing is:

df.loc[df.duplicated(subset=['name'], take_last=True), 'name'] = \
df.loc[df.duplicated(subset=['name'], take_last=True), 'name'].apply(lambda x: x+'2')

But my feeling is that there might be a better way. Any ideas or tips? I would really appreciate your feedback!

Note that your solution only works if there is a maximum of one duplicate. Also, you should be able to replace everything after the = with df.name.duplicated(take_last=True).apply... — ari
– ari, Commented Jan 6, 2015 at 21:25

BrenBarn · Accepted Answer · 2015-01-06 20:52:58Z

3

Here is one way:

# sample data
d = pandas.DataFrame(
    {'Name': ['bob', 'bob', 'bob', 'bill', 'fred', 'fred', 'joe', 'larry'],
     'ShoeShize': [8, 9, 10, 12, 14, 11, 10, 12]
    }
)

>>> d.groupby('Name').Name.apply(lambda n: n + (np.arange(len(n))+1).astype(str))
0      bob1
1      bob2
2      bob3
3     bill1
4     fred1
5     fred2
6      joe1
7    larry1

This appends an indicator to all. If you want to append the indicator to only those after the first, you can do it with a little special casing:

>>> d.groupby('Name').Name.apply(lambda n: n + np.concatenate(([''], (np.arange(len(n))+1).astype(str)[1:])))
0      bob
1     bob2
2     bob3
3     bill
4     fred
5    fred2
6      joe
7    larry
dtype: object

If you want to use this to replace the original names just do d.Name = ... where ... is the expression shown above.

You should think about why you're doing this. It is usually better to have this sort of information in a separate column than smashed into a string.

answered Jan 6, 2015 at 20:52

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2489252 Over a year ago

Thanks, that's a pretty good solution! The problem is that I want to merge and update DataFrames that come from different sources. I thought about taking more "first name letters", but some sources only have 1 first name letter, so that would not be an option ...

cglacet Over a year ago

Thanks, you could also use np.arange(2, len(n) + 1), that seems clearer to me.

Collectives™ on Stack Overflow

How to modify duplicated rows in Python pandas

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related