3

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -

server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log

I used the below regex but i get both server_server in my output,

((.*?))_(?!\D)

How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is? The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc

Expected output -

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
0

2 Answers 2

4

you could back reference the word in your search expression:

>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'

and use the "many times" suffix so if there are more than 2 occurrences it still works:

'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'

getting rid of the suffix is not the hardest part, just capture the rest and discard the end:

>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
Sign up to request clarification or add additional context in comments.

2 Comments

hi thakns for the solution, i dont actually need the part from the number. i.e. _1233.zzz is not required. just server_dev1_check
just noticed. not the heart of the problem but okay:)
3

You may use a single re.sub call to match and remove what you do not need and match and capture what you need:

re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)

See the regex demo

Details

  • ^ - start of string
  • ([^_]+) - Capturing group 1: any 1+ chars other than _
  • (?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
  • (.*) - Group 2: any 0+ chars, as many as possible
  • _ - an underscore
  • \d+ - 1+ digits
  • \. - a dot
  • \w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
  • $ - end of string.

The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.

Python demo:

import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
    print(re.sub(rx, r'\1\2', s))

Output:

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

1 Comment

Wiktor Stribiżew - thank you so much. exactly what i needed. and thanks for the clear explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.