2

I have a string with parameters in it:

text =  "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"

I want to remove spaces to obtain all parameters individually in the following way:

pred_res = ["Uncertain significance","PVS1=0","PS=[0, 0, 0, 0, 0]","PM=[0, 0, 0, 0, 0, 0, 0]","PP=[0, 0, 0, 0, 0, 0]","BA1=0","BS=[0, 0, 0, 0, 0]","BP=[0, 0, 0, 0, 0, 0, 0, 0]"]

So far I have used this regex pattern:

pat = re.compile('[a-z]\s[A-Z]|[0-9]\s[A-Z]|]\s[A-Z]')

But it's giving me the result in the following way where it removes characters:

res = ["Uncertain significanc","VS1=","S=[0, 0, 0, 0, 0","M=[0, 0, 0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0","A1=","S=[0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0, 0, 0]"]

So is there a way to prevent this and obtain the result shown in pred_res?

3
  • So you want (list of words) OR (XX=[...]) ? Also you didn't show the regex method that you used on your pat pattern Commented Apr 27, 2021 at 11:00
  • Maybe re.split(r'\s+(?=[A-Z])', text) will do? Commented Apr 27, 2021 at 11:03
  • @azro I want the result like pred_res. And i used the pattern in Series.str.split(). As i have column with data like variable text. I just used a single example. Commented Apr 27, 2021 at 11:07

3 Answers 3

4

You can use a lookahead to check that there is an = in the text immediately following a space.

import re
text = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
pred_res = re.split(r' (?=\w+=)', text)
print(pred_res)
# ['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much it worked like magic. If its not much trouble can you can explain how did you come up with this pattern?
@NikhilPanchal There's a brief description of lookaheads here: Regex lookahead, lookbehind and atomic groups, but ultimately they're just something you learn at some point, the example used here allows you to search for a string which has another string following ("Look ahead positive (?=)" on the post linked). The reason I went for a space with = in the following text was that that is where all the splits occured in the string in your example.
3

Another option could be matching all the separate parts.

\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)
  • \w+= Match 1+ word char followed by =
  • (?: Non capture group
    • \[[^][]*] match from [ till ]
    • | Or
    • [^][\s]+ Match any char except a whitespace char or char [ and ]
  • ) Close the group
  • | or
  • \w+(?: \w+)*(?= \w+=|$) Match word chars optionally repeated by a space and word chars asserting word chars followed by = or the end of the string at the right

Regex demo

import re

s = "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"
pattern = r"\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)"

pred_res = re.findall(pattern, s)
print(pred_res)

Output

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Comments

1

Use

\s+(?=[A-Z])

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code:

import re
test_str = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
matches = re.split(r'\s+(?=[A-Z])', test_str)
print(matches)

Results:

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

1 Comment

That's one beautiful explanation :o

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.