Using regex to remove the unnecessary whitespace to get the expected output

Question

I have a few unstructured data like this

test1     21;
 test2  22;
test3    [ 23 ];

and I want to remove the unnecessary whitespace and convert it into the list of two-item per row and the expected output should look like this

['test1', '21']
['test2', '22']
['test3', ['23']]

Now, I am using this regex sub method to remove the unnecessary whitespace

re.sub(r"\s+", " ", z.rstrip('\n').lstrip(' ').rstrip(';')).split(' ')

Now, the problem is that it is able to replace the unnecessary whitespace into single whitespace, which is fine. But the problem I am facing in the third example, where after and before the open and close bracket respectively, it has whitespace and that I what to remove. But using the above regex I am not able to.

This is the output currently I am getting

['test1', '21']
['test2', '22']
['test3', '[', '23', ']']

You may check the example here on pythontutor.

Sorry, my bad. actually I need those square brackets. Let me update the post. Sorry again. — Tony Montana
– Tony Montana, Commented Nov 9, 2021 at 14:57
This is not a possible outcome with regex, since ['23'] is a result of making an array. Not really a regex strong point. — sln
– sln, Commented Nov 9, 2021 at 21:34

anubhava · Accepted Answer · 2021-11-09 15:59:41Z

2

You may use this regex with 2 capture groups:

(\w+)\s+(\[[^]]+\]|\w+);

RegEx Demo

RegEx Details:

(\w+): Match 1+ word characters in first capture group
\s+: Match 1+ whitespaces
(\[[^]]+\]|\w+): Match a [...] string or a word in second capture group
;: Match a ;

Code:

>>> import re
>>> data = '''
... test1     21;
...  test2  22;
... test3    [ 23 ];
... '''
>>> res = []
>>>
>>> for i in re.findall(r'(\w+)\s+(\[[^]]+\]|\w+);', data):
...     res.append([ i[0], eval(re.sub(r'^(\[)\s*|\s*(\])$', r'\1"\2', i[1])) if i[1].startswith('[') else i[1] ])
...
>>> print (res)
[['test1', '21'], ['test2', '22'], ['test3', ['23']]]

edited Nov 9, 2021 at 15:59

answered Nov 9, 2021 at 15:18

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2021-11-10 12:56:27Z

1

You can use

import re
 
x = "test1     21"
y = "     test2  22"
z = "    test3    [ 23 ]"
 
for a in [x, y, z]:
    print(re.sub(r"(?<![^[\s])\s+|\s+(?=])", "", a.rstrip('\n').lstrip(' ').rstrip(';')).split(' '))

See the Python demo. Output:

['test1', '21']
['test2', '22']
['test3', '[23]']

Details:

(?<![^[\s])\s+ - one or more whitespaces that are preceded with a [ char, whitespace or start of string
| - or
\s+(?=]) - one or more whitespaces that are followed with a ] char.

edited Nov 10, 2021 at 12:56

answered Nov 9, 2021 at 15:21

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Collectives™ on Stack Overflow

Using regex to remove the unnecessary whitespace to get the expected output

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related