1

Update: This issue was caused by a bug in the regex module which was resolved by the developer in commit be893e9

If you encounter a similiar problem, update your regex module.
You need version 2017.04.23 or above.

See here for further information.


Background: I'm using a collection of regular expression (english.lex) in a 3rd party Text2Speech engine to normalize the input text before speaking it.

For debugging purposes, I wrote the script below to see what impact my regex collection actually has on the input text.

My problem is that it's replacing a regex that simply does not match


I have 3 files:

regex_preview.py

#!/usr/bin/env python
import codecs
import regex as re

input="Text2Speach Regex Test.txt"
dictionary="english.lex"

with codecs.open(dictionary, "r", "utf16") as f:
    reg_exen = f.readlines()
    with codecs.open(input, "r+", "utf16") as g:
        content = g.read().replace(r'\\\\\"','"')

        # apply all regular expressions to content
        for line in reg_exen:
            line=line.strip()

            # skip comments
            if line == "" or line[0] == "#":
                pass
            else:
                # remove " from lines and split them into pattern and substitue
                pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
                substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')

                print("\n'%s' ==> '%s'" % (pattern, substitute))

                print(content.strip())
                content = re.sub(pattern, substitute, content)
                print(content.strip())

english.lex - utf16 encoded

# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."

# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O

Text2Speach Regex Test.txt - utf16 encoded

“Erm….yes. Thank you for that.”

Running the script produces this output with the last regex somehow matching the content:

'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."

'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."

'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."

What I tried so far:

I created this snipped to reproduce the issue:

#!/usr/bin/env python

import re
import codecs

content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)

print(content)

But this actually behaves like it should. So I'm at a loss at what's happening here.

Hopefully, someone can point me in the right direction for further investigation...

6
  • Your regex will match strings like abc....abc123 or abc-abc123, it cannot match Erm....yes. Commented Apr 22, 2017 at 15:19
  • I know - I wrote it - and that's exactly my problem: why the hell is the script able to match that regex at all. I suspect it's some kind of encoding issue but I have no idea where I went wrong. Commented Apr 22, 2017 at 15:23
  • When debugging Python regexes , a good start is to use r'text' style literal strings for the regex expressions. It's too easy to get confused with multiple backslashes and which backslash-plus-character combinations are escapes / which aren't. Try that, and if the problem continues it'll be easier for us to understand and make more suggestions. Commented Apr 22, 2017 at 15:37
  • Ok, for clarification, you are suggesting to change the contents of english.lex or of the "reproduce snipped"? ( I changed the parts in the "reproduce snipped" and its still working - like before. The main script is still acting up though - like before...) Commented Apr 22, 2017 at 15:41
  • IIRC, the python unicode string encoding is UTF-8, not UTF-16, if it matters. Commented Apr 22, 2017 at 15:44

1 Answer 1

3

The original script is using the alternative regex module instead of the standard library re module.

import regex as re

There's clearly some difference between the two in this case. My guess is that it has something to do with nested groups. This expression contains a capturing group within a non-capturing group, which is way too magical for my taste.

import re     # standard library
import regex  # completely different implementation

content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

Output:

"Erm....yes. Thank you for that."
"-yes. Thank you for that."
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for finding that one. As some other expressions require look behinds with non-fixed length I switched to the regex module instead of re. And in my test script I imported re without even noticing... Well looks like I just have to simplify the regular expressions. Do you happen to know a good source of information regarding what is magical and what not?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.