Split a string with different condition without removing the character in python

Question

I have a string with parameters in it:

text =  "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"

I want to remove spaces to obtain all parameters individually in the following way:

pred_res = ["Uncertain significance","PVS1=0","PS=[0, 0, 0, 0, 0]","PM=[0, 0, 0, 0, 0, 0, 0]","PP=[0, 0, 0, 0, 0, 0]","BA1=0","BS=[0, 0, 0, 0, 0]","BP=[0, 0, 0, 0, 0, 0, 0, 0]"]

So far I have used this regex pattern:

pat = re.compile('[a-z]\s[A-Z]|[0-9]\s[A-Z]|]\s[A-Z]')

But it's giving me the result in the following way where it removes characters:

res = ["Uncertain significanc","VS1=","S=[0, 0, 0, 0, 0","M=[0, 0, 0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0","A1=","S=[0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0, 0, 0]"]

So is there a way to prevent this and obtain the result shown in pred_res?

So you want (list of words) OR (XX=[...]) ? Also you didn't show the regex method that you used on your pat pattern — azro
– azro, Commented Apr 27, 2021 at 11:00
@azro I want the result like pred_res. And i used the pattern in Series.str.split(). As i have column with data like variable text. I just used a single example. — Nikhil Panchal
– Nikhil Panchal, Commented Apr 27, 2021 at 11:07

Nick is tired · Accepted Answer · 2021-04-27 11:04:30Z

4

You can use a lookahead to check that there is an = in the text immediately following a space.

import re
text = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
pred_res = re.split(r' (?=\w+=)', text)
print(pred_res)
# ['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

answered Apr 27, 2021 at 11:04

Nick is tired

7,17721 gold badges44 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nikhil Panchal Over a year ago

Thank you so much it worked like magic. If its not much trouble can you can explain how did you come up with this pattern?

Nick is tired Over a year ago

@NikhilPanchal There's a brief description of lookaheads here: Regex lookahead, lookbehind and atomic groups, but ultimately they're just something you learn at some point, the example used here allows you to search for a string which has another string following ("Look ahead positive (?=)" on the post linked). The reason I went for a space with = in the following text was that that is where all the splits occured in the string in your example.

The fourth bird · Accepted Answer · 2021-04-27 11:15:28Z

Another option could be matching all the separate parts.

\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)

\w+= Match 1+ word char followed by =
(?: Non capture group
- \[[^][]*] match from [ till ]
- | Or
- [^][\s]+ Match any char except a whitespace char or char [ and ]
) Close the group
| or
\w+(?: \w+)*(?= \w+=|$) Match word chars optionally repeated by a space and word chars asserting word chars followed by = or the end of the string at the right

Regex demo

import re

s = "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"
pattern = r"\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)"

pred_res = re.findall(pattern, s)
print(pred_res)

Output

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Ryszard Czech · Accepted Answer · 2021-04-27 22:29:53Z

1

Use

\s+(?=[A-Z])

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code:

import re
test_str = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
matches = re.split(r'\s+(?=[A-Z])', test_str)
print(matches)

Results:

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

answered Apr 27, 2021 at 22:29

Ryszard Czech

18.7k4 gold badges27 silver badges39 bronze badges

1 Comment

Nikhil Panchal Over a year ago

That's one beautiful explanation :o

Collectives™ on Stack Overflow

Split a string with different condition without removing the character in python

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related