0

Let's say I have a DataFrame (that I sorted by some priority criterion) with a "name" column. Few names are duplicated, and I want to append a simple indicator to the duplicates.

E.g.,

'jones a'
... 
'jones a'    # this should become 'jones a2'

To get the subset of duplicates, I could do

df.loc[df.duplicated(subset=['name'], take_last=True), 'name']

However, I think the apply function does not allow for inplace modification, right? So what I basically ended up doing is:

df.loc[df.duplicated(subset=['name'], take_last=True), 'name'] = \
df.loc[df.duplicated(subset=['name'], take_last=True), 'name'].apply(lambda x: x+'2')

But my feeling is that there might be a better way. Any ideas or tips? I would really appreciate your feedback!

1
  • Note that your solution only works if there is a maximum of one duplicate. Also, you should be able to replace everything after the = with df.name.duplicated(take_last=True).apply... Commented Jan 6, 2015 at 21:25

1 Answer 1

3

Here is one way:

# sample data
d = pandas.DataFrame(
    {'Name': ['bob', 'bob', 'bob', 'bill', 'fred', 'fred', 'joe', 'larry'],
     'ShoeShize': [8, 9, 10, 12, 14, 11, 10, 12]
    }
)

>>> d.groupby('Name').Name.apply(lambda n: n + (np.arange(len(n))+1).astype(str))
0      bob1
1      bob2
2      bob3
3     bill1
4     fred1
5     fred2
6      joe1
7    larry1

This appends an indicator to all. If you want to append the indicator to only those after the first, you can do it with a little special casing:

>>> d.groupby('Name').Name.apply(lambda n: n + np.concatenate(([''], (np.arange(len(n))+1).astype(str)[1:])))
0      bob
1     bob2
2     bob3
3     bill
4     fred
5    fred2
6      joe
7    larry
dtype: object

If you want to use this to replace the original names just do d.Name = ... where ... is the expression shown above.

You should think about why you're doing this. It is usually better to have this sort of information in a separate column than smashed into a string.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, that's a pretty good solution! The problem is that I want to merge and update DataFrames that come from different sources. I thought about taking more "first name letters", but some sources only have 1 first name letter, so that would not be an option ...
Thanks, you could also use np.arange(2, len(n) + 1), that seems clearer to me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.