Update: This issue was caused by a bug in the regex module which was resolved by the developer in commit be893e9
If you encounter a similiar problem, update your regex module.
You need version 2017.04.23 or above.
See here for further information.
Background: I'm using a collection of regular expression (english.lex) in a 3rd party Text2Speech engine to normalize the input text before speaking it.
For debugging purposes, I wrote the script below to see what impact my regex collection actually has on the input text.
My problem is that it's replacing a regex that simply does not match
I have 3 files:
regex_preview.py
#!/usr/bin/env python
import codecs
import regex as re
input="Text2Speach Regex Test.txt"
dictionary="english.lex"
with codecs.open(dictionary, "r", "utf16") as f:
reg_exen = f.readlines()
with codecs.open(input, "r+", "utf16") as g:
content = g.read().replace(r'\\\\\"','"')
# apply all regular expressions to content
for line in reg_exen:
line=line.strip()
# skip comments
if line == "" or line[0] == "#":
pass
else:
# remove " from lines and split them into pattern and substitue
pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')
print("\n'%s' ==> '%s'" % (pattern, substitute))
print(content.strip())
content = re.sub(pattern, substitute, content)
print(content.strip())
english.lex - utf16 encoded
# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."
# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O
Text2Speach Regex Test.txt - utf16 encoded
“Erm….yes. Thank you for that.”
Running the script produces this output with the last regex somehow matching the content:
'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."
'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."
'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."
What I tried so far:
I created this snipped to reproduce the issue:
#!/usr/bin/env python
import re
import codecs
content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)
print(content)
But this actually behaves like it should. So I'm at a loss at what's happening here.
Hopefully, someone can point me in the right direction for further investigation...
abc....abc123orabc-abc123, it cannot matchErm....yes.r'text'style literal strings for the regex expressions. It's too easy to get confused with multiple backslashes and which backslash-plus-character combinations are escapes / which aren't. Try that, and if the problem continues it'll be easier for us to understand and make more suggestions.english.lexor of the "reproduce snipped"? ( I changed the parts in the "reproduce snipped" and its still working - like before. The main script is still acting up though - like before...)