Python regex tokenizer for simple expression

Question

I wrote this regex that splits the expression 'res=3+x_sum*11' into lexemes

import re
print(re.findall('(\w+)(=)(\d+)(\*|\+)(\w+)(\*|\+)(\d+)', 'res=3+x_sum*11'))

with my output looking like this:

[('res', '=', '3', '+', 'x_sum', '*', '11')]

but i want re.findall to return a list of the lexemes and their tokens so that each lexeme is in its own group. That output should look like this:

[('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'),

('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]

How do i get re.findall to return an output like that

Wiktor Stribiżew · Accepted Answer · 2018-05-03 13:15:04Z

2

You may tokenize the string using

re.findall(r'(\d+)|([^\W\d]+)|(\W)', s)

See the regex demo. Note that re.findall returns a list of tuples once the pattern contains several capturing groups. The pattern above contains 3 capturing groups, thus, each tuple contains 3 elements: 1+ digits, 1+ letters/underscores, or a non-word char.

More details

(\d+) - Capturing group 1: 1+ digits
| - or
([^\W\d]+) - Capturing group 2: 1+ chars other than non-word and digit chars (letters or underscores)
| - or
(\W) - Capturing group 3: a non-word char.

See Python demo:

import re
rx = r"(\d+)|([^\W\d]+)|(\W)"
s = "res=3+x_sum*11"
print(re.findall(rx, s))
# => [('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'), ('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]

answered May 3, 2018 at 13:15

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python regex tokenizer for simple expression

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related