Python regex removing duplicated names

Question

I have a Series of names. If a name is repeated, I'd like to have only one.

John Smith
David BrownDavid Brown

I'd like to have output

John Smith
David Brown

I found ways to use '\b(\w+)( \1\b)+' to catch the white space between names and keep the second one with r'\1'. However, in my case, there is no whitespace. Does that mean I need to compare strings character by character to find duplicates? Is there any simpler way ?

You could try normalizing the name (removing whitespaces and have it all be upper or lower case) and then check if name[:len(name)/2] == name[len(name)/2:] ? Or could you have extra trash characters? The regex should work too if you normalize this (\b(\w+)(\1\b)+) — Nerdrigo
– Nerdrigo, Commented Jun 13, 2022 at 4:01

Hao Wu · Accepted Answer · 2022-06-13 04:23:27Z

2

You can use a non-greedy modifier(?) to test the words to find all the dupilcates optionally:

\b(\w+? \w+?)\1*\b

Check the test cases

You may also add another name section to support middle names such as:

\b(\w+? \w+?(?: \w+?)?)\1*\b

edited Jun 13, 2022 at 4:23

answered Jun 13, 2022 at 4:16

Hao Wu

21.6k7 gold badges37 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2022-06-13 09:58:49Z

0

You can use

\b(.+?)\1\b

See the regex demo. Details:

\b - a word boundary
(.+?) - Group 1: one or more chars other than line break chars as few as possible
\1 - Same value as in Group 1
\b - a word boundary

In Pandas, you can use

df['column_name'] = df['column_name'].str.replace(r'\b(.+?)\1\b', r'\1', regex=True)

answered Jun 13, 2022 at 9:58

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Collectives™ on Stack Overflow

Python regex removing duplicated names

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related