3

I have a few unstructured data like this

test1     21;
 test2  22;
test3    [ 23 ];

and I want to remove the unnecessary whitespace and convert it into the list of two-item per row and the expected output should look like this

['test1', '21']
['test2', '22']
['test3', ['23']]

Now, I am using this regex sub method to remove the unnecessary whitespace

re.sub(r"\s+", " ", z.rstrip('\n').lstrip(' ').rstrip(';')).split(' ')

Now, the problem is that it is able to replace the unnecessary whitespace into single whitespace, which is fine. But the problem I am facing in the third example, where after and before the open and close bracket respectively, it has whitespace and that I what to remove. But using the above regex I am not able to.

This is the output currently I am getting

['test1', '21']
['test2', '22']
['test3', '[', '23', ']']

You may check the example here on pythontutor.

3
  • Why not remove the square brackets before? Commented Nov 9, 2021 at 14:50
  • Sorry, my bad. actually I need those square brackets. Let me update the post. Sorry again. Commented Nov 9, 2021 at 14:57
  • This is not a possible outcome with regex, since ['23'] is a result of making an array. Not really a regex strong point. Commented Nov 9, 2021 at 21:34

2 Answers 2

2

You may use this regex with 2 capture groups:

(\w+)\s+(\[[^]]+\]|\w+);

RegEx Demo

RegEx Details:

  • (\w+): Match 1+ word characters in first capture group
  • \s+: Match 1+ whitespaces
  • (\[[^]]+\]|\w+): Match a [...] string or a word in second capture group
  • ;: Match a ;

Code:

>>> import re
>>> data = '''
... test1     21;
...  test2  22;
... test3    [ 23 ];
... '''
>>> res = []
>>>
>>> for i in re.findall(r'(\w+)\s+(\[[^]]+\]|\w+);', data):
...     res.append([ i[0], eval(re.sub(r'^(\[)\s*|\s*(\])$', r'\1"\2', i[1])) if i[1].startswith('[') else i[1] ])
...
>>> print (res)
[['test1', '21'], ['test2', '22'], ['test3', ['23']]]
Sign up to request clarification or add additional context in comments.

Comments

1

You can use

import re
 
x = "test1     21"
y = "     test2  22"
z = "    test3    [ 23 ]"
 
for a in [x, y, z]:
    print(re.sub(r"(?<![^[\s])\s+|\s+(?=])", "", a.rstrip('\n').lstrip(' ').rstrip(';')).split(' '))

See the Python demo. Output:

['test1', '21']
['test2', '22']
['test3', '[23]']

Details:

  • (?<![^[\s])\s+ - one or more whitespaces that are preceded with a [ char, whitespace or start of string
  • | - or
  • \s+(?=]) - one or more whitespaces that are followed with a ] char.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.