4

I wrote this regex that splits the expression 'res=3+x_sum*11' into lexemes

import re
print(re.findall('(\w+)(=)(\d+)(\*|\+)(\w+)(\*|\+)(\d+)', 'res=3+x_sum*11'))

with my output looking like this:

[('res', '=', '3', '+', 'x_sum', '*', '11')]

but i want re.findall to return a list of the lexemes and their tokens so that each lexeme is in its own group. That output should look like this:

[('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'),

('', 'x_sum', ''), ('', '', '*'), ('11', '', '')] 

How do i get re.findall to return an output like that

0

1 Answer 1

2

You may tokenize the string using

re.findall(r'(\d+)|([^\W\d]+)|(\W)', s)

See the regex demo. Note that re.findall returns a list of tuples once the pattern contains several capturing groups. The pattern above contains 3 capturing groups, thus, each tuple contains 3 elements: 1+ digits, 1+ letters/underscores, or a non-word char.

More details

  • (\d+) - Capturing group 1: 1+ digits
  • | - or
  • ([^\W\d]+) - Capturing group 2: 1+ chars other than non-word and digit chars (letters or underscores)
  • | - or
  • (\W) - Capturing group 3: a non-word char.

See Python demo:

import re
rx = r"(\d+)|([^\W\d]+)|(\W)"
s = "res=3+x_sum*11"
print(re.findall(rx, s))
# => [('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'), ('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.