0

I have different strings of the form _AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_ and I want to cut out the (SGYA) part (always capital letters in round brackets) and eventual spaces directly before or after it. So the result should be _AHDHDUHD[Tsfs]AHUDSHDI_.

I had the idea of matching the content of the square brackets with ([A-Z_])(\[.+\])([A-Z_]) and then doing a split and re-inserting it using re module (although I am not sure which re function is suited for this).

However, this feels inelegant. Is there a regex that would do what I want directly, without the intermediary steps?

5 Answers 5

1

You may use

re.sub(r'(\[[^][]*?)\s*\([A-Z]*\)\s*([^][]*])', r'\1\2', text)

See the regex demo

Details

  • (\[[^][]*?) - Group 1: a [ and then any 0+ chars other than [ and ] as few as possible
  • \s* - 0+ whitespaces
  • \( - a ( char
  • [A-Z]* - 0+ uppercase ASCII letters
  • \) - a ) char
  • \s* - 0+ whitespaces
  • ([^][]*]) - Group 2: any 0+ chars other than ] and [ (as many as possible) and then a ]

Python demo:

import re
rx = r"(\[[^][]*?)\s*\([A-Z]*\)\s*([^][]*])"
s = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI"
print( re.sub(rx, r'\1\2', s) )
# => _AHDHDUHD[Tsfs]AHUDSHDI

Another idea: only remove all \s*\([A-Z]+\)\s* matches when found inside [...] substrings:

import re
s = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI"
print( re.sub(r"\[[^][]+]", lambda x: re.sub(r'\s*\([A-Z]+\)\s*', "", x.group()), s) )
# => _AHDHDUHD[Tsfs]AHUDSHDI

See another Python demo.

Here, the \[[^][]+] pattern will find all chunks of [, then 1+ chars other than square brackets and then a ], and then any occurrences of 0+ whitespaces, (, 1+ uppercase ASCII letters, ) and 0+ whitespaces will be removed only inside the matches found with the \[[^][]+] pattern.

Sign up to request clarification or add additional context in comments.

5 Comments

The regex can be simplified to r'(?:\s?\((.*?)\))' since OP just want to get rid of what is inside () and any preceding spaces. Check my answer.
@accdias Well, OP added some more requirements for that, see always capital letters in round brackets. \(.*?\) will "overfire".
Indeed but I'm guessing he just wanted to mean anything inside round brackets. But it is just a guess and you're being more precise anyway.
@accdias It should only be capital letters. Matching everything is not wrong, of course, but only matching capital letters is more concise indeed.
@accdias In the very first regex I wrote in my answer I used \([^][()]*\), to make sure I stay within [...] and match any chars inside (...) but parentheses. Then, I paid attention to the capital letters restriction.
1
import re


weirdstring =  "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_"
weirdstring = re.sub(r'(.*?)(\s*\(.*?\)\s*)(.*?)', r'\1\3', weirdstring)

print(weirdstring)

# prints _AHDHDUHD[Tsfs]AHUDSHDI_

Comments

1

This will do what you want:

Python 3.7.5 (default, Oct 17 2019, 12:16:48) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> s='_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_'
>>> re.sub(r'(?:\s?\((.*)\))', '', s)
'_AHDHDUHD[Tsfs]AHUDSHDI_'
>>> 

If you want to only match capital letters inside square brackets, then the expression should be:

>>> re.sub(r'(?:\s?\(([A-Z]+)\))', '', s)
'_AHDHDUHD[Tsfs]AHUDSHDI_'
>>>

I hope it helps.

Comments

0

You are looking for the re.sub function

import re
s = "AHDHDUHD[Tsfs (SGYA)]AHUDSHDI" 
s_re = re.sub("(.*?)(\s*\(.*?\)\s*)(.*?)", '', s)
print (s_re)

It will print:

AHDHDUHD[Tsfs]AHUDSHDI

2 Comments

Check the desired output. Your solution is not what OP wants.
Yeap, the solution of Wiktor Stribiżew is more appropriate. I changed the regex in order to have the desired output.
0

You could use 2 capturing groups and in the replacement use both capturing groups \1\2

([A-Z_]+\[[^(\s]+)[^\S\r\n]*\([A-Z]+\)[^\S\r\n]*(\][A-Z_]+)

In parts

  • ( Capture group 1
    • [A-Z_]+ Match 1+ chars A-Z or _
    • \[[^(\s]+ Match [ and 1+ any chars except the listed
  • ) Close group
  • [^\S\r\n]* Match 0+ whitespace chars except newline
  • \([A-Z]+\) Match chars A-Z between parenthesis
  • [^\S\r\n]* Match 0+ whitespace chars except newline
  • ( Capture group 2
    • \][A-Z_]+ Match ] and 1+ chars A-Z or _
  • ) Close group

Regex demo | Python demo

For example

import re

regex = r"([A-Z_]+\[[^(\s]+)[^\S\r\n]*\([A-Z]+\)[^\S\r\n]*(\][A-Z_]+)"
test_str = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_"
print(re.sub(regex, r"\1\2", test_str))

Output

_AHDHDUHD[Tsfs]AHUDSHDI_

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.