Python regex module vs re module - pattern mismatch

Question

Update: This issue was resolved by the developer in commit be893e9

If you encounter the same problem, update your regex module.
You need version 2017.04.23 or above.

As pointed out in this answer I need this regular expression:

(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})

working with the regex module too...

import re     # standard library
import regex  # https://pypi.python.org/pypi/regex/

content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

Output:

"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."

Q: How do I have to write this regex to make the regex module react to it the same way the re module does?

Using the re module is not an option as I require look-behinds with dynamic lengths.

For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex

To clarify: You need this regex to work with both re and regex, or just with regex? — Aran-Fey
– Aran-Fey, Commented Apr 22, 2017 at 19:52
Why use (?<=\b) instead of \b which is a zero-length assertion. — bobble bubble
– bobble bubble, Commented Apr 22, 2017 at 19:56
@Rawing I updated the question + BitBucket issue can be found here: bitbucket.org/mrabarnett/mrab-regex/issues/238/… — Fabian N.
– Fabian N., Commented Apr 22, 2017 at 20:51
Issue is fixed, I added a note to the question for everyone that encounters the same bug to update there regex module. — Fabian N.
– Fabian N., Commented Apr 23, 2017 at 15:01

Aran-Fey · Accepted Answer · 2017-04-23 13:32:29Z

6

It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.

An example:

>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'

For reference, the expected output would be:

>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'

In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.

As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):

>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'

As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):

>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'

edited Apr 23, 2017 at 13:32

answered Apr 22, 2017 at 20:49

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Aran-Fey Over a year ago

@Aprillion Nice catch, thanks. Not sure why that's happening.

Fabian N. Over a year ago

@Rawing you still have a typo, your produced output is '"Erm....yes. T-Thank you for that."'

Aran-Fey Over a year ago

@FabianN. I would like to agree, but it actually does produce T..T-Thank on my machine.

Aran-Fey Over a year ago

I figured it out; it was necessary to prevent the capture group from matching at all. I had to move the lookahead assertion in front of the capture group. Answer updated.

Aprillion Over a year ago

nice solution... this would be consistent with my theory of not reverting empty match groups from unsuccessful longer match during backtracking (for unsuccessful lookahed, the longer match attempt fails before storing empty match into the \2 group)

Aprillion · Accepted Answer · 2017-04-23 17:21:59Z

1

edit: the bug is now resolved in regex 2017.04.23

just tested in Python 3.6.1 and the original pattern works the same in re and regex

Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):

pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"

The bug in 2017.04.05 was due to backtracking, something like this:

The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.

Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:

import re
import regex

content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']

for tc in test_cases:
    pattern = base_pattern_template % tc
    expected = re.findall(pattern, content)
    actual = regex.findall(pattern, content)
    # TODO: convert to test case, e.g. in pytest
    # assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
    print('expected:', tc, expected)
    print('actual:  ', tc, actual)

output:

expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual:   1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual:   3 []
expected: 2 [('T...', 'T', '...')]
actual:   2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual:   1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]

edited Apr 23, 2017 at 17:21

answered Apr 22, 2017 at 20:59

Aprillion

22.4k6 gold badges59 silver badges94 bronze badges

5 Comments

Aran-Fey Over a year ago

That's not really a workaround, that's a modification of the pattern. If you try it with input like a-abc-abcxy, it will produce different output than the original pattern.

Fabian N. Over a year ago

@Rawing Thanks for pointing that out. I did some testing regarding the original usecase of this pattern (see here stackoverflow.com/questions/43560759/…) and it is indeed possible that undesired output will be produced e.g. T...Tha....Thanks

Aprillion Over a year ago

right, not a solution, just a workaround to a subset of problems when you don't need to handle T...Tha....Thank and/or if T-Tha...Thank output is as good enough as Tha-Thank would have been (both are meaningless to me, so I would give my workaround a chance and ask the customer if the workaround is good enough for them)

Aprillion Over a year ago

also for T..T..Th...Thank the lazy pattern gives Th-Thank in re but T-Th...Thank in regex

Fabian N. Over a year ago

Unfortinually the Text2Speach engine this regex is sanitizing input for is a bit picky and makes awkward breaks or even says 'dot' when there are still some left. Hopefully, the developer will find this post useful and fix this in the near future...

Collectives™ on Stack Overflow

Python regex module vs re module - pattern mismatch

2 Answers 2

5 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related