Remove duplicate words in a string using regex

Question

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -

server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log

I used the below regex but i get both server_server in my output,

((.*?))_(?!\D)

How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is? The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc

Expected output -

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Jean-François Fabre · Accepted Answer · 2018-09-21 12:43:23Z

4

you could back reference the word in your search expression:

>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'

and use the "many times" suffix so if there are more than 2 occurrences it still works:

'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'

getting rid of the suffix is not the hardest part, just capture the rest and discard the end:

>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'

edited Sep 21, 2018 at 12:43

answered Sep 21, 2018 at 12:37

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sdgd Over a year ago

hi thakns for the solution, i dont actually need the part from the number. i.e. _1233.zzz is not required. just server_dev1_check

Jean-François Fabre Over a year ago

just noticed. not the heart of the problem but okay:)

Wiktor Stribiżew · Accepted Answer · 2018-09-21 12:39:02Z

3

You may use a single re.sub call to match and remove what you do not need and match and capture what you need:

re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)

See the regex demo

Details

^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.

The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.

Python demo:

import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
    print(re.sub(rx, r'\1\2', s))

Output:

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

answered Sep 21, 2018 at 12:39

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

1 Comment

sdgd Over a year ago

Wiktor Stribiżew - thank you so much. exactly what i needed. and thanks for the clear explanation.

Collectives™ on Stack Overflow

Remove duplicate words in a string using regex

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related