0

I have a Series of names. If a name is repeated, I'd like to have only one.

John Smith
David BrownDavid Brown

I'd like to have output

John Smith
David Brown

I found ways to use '\b(\w+)( \1\b)+' to catch the white space between names and keep the second one with r'\1'. However, in my case, there is no whitespace. Does that mean I need to compare strings character by character to find duplicates? Is there any simpler way ?

2
  • 1
    You could try normalizing the name (removing whitespaces and have it all be upper or lower case) and then check if name[:len(name)/2] == name[len(name)/2:] ? Or could you have extra trash characters? The regex should work too if you normalize this (\b(\w+)(\1\b)+) Commented Jun 13, 2022 at 4:01
  • for second line regex is ` (\w+ .+?)\1+ ` Commented Jun 13, 2022 at 4:16

2 Answers 2

2

You can use a non-greedy modifier(?) to test the words to find all the dupilcates optionally:

\b(\w+? \w+?)\1*\b

Check the test cases


You may also add another name section to support middle names such as:

\b(\w+? \w+?(?: \w+?)?)\1*\b
Sign up to request clarification or add additional context in comments.

Comments

0

You can use

\b(.+?)\1\b

See the regex demo. Details:

  • \b - a word boundary
  • (.+?) - Group 1: one or more chars other than line break chars as few as possible
  • \1 - Same value as in Group 1
  • \b - a word boundary

In Pandas, you can use

df['column_name'] = df['column_name'].str.replace(r'\b(.+?)\1\b', r'\1', regex=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.