12

I try to compile a big pattern with re.compile in Python 3.

The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words

Python doesn't raise any error.

What I do is:

stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)

The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!

Is there a max length for the regex pattern?

16
  • 1
    Could you post a full working example program? This is impossible to reproduce right now. Commented May 13, 2015 at 17:49
  • 2
    I think you're confusing the string representation of stopword_pattern with the pattern it actually stores internally. Commented May 13, 2015 at 17:51
  • making an alternation with 500 items is a very bad idea. Commented May 13, 2015 at 18:20
  • 1
    @Grief: The Python re module doesn't work the way you think. Most of the regex engines used in modern languages (Perl, Python, PHP, Java...) don't generate a DFA. The main reasons are to offer a better control of the way the regex engine will search the string, to reduce the compilation time and to provide features that are impossible (or don't make sense) with a DFA regex engine (backreferences, atomic grouping, non-greedy quantifiers, backtracking...). The counter part of this choice is that these engines work in a more silly way and the search is slower in some situations. Commented Jul 12, 2016 at 11:41
  • 1
    @Grief: In particular they don't work in parallel. Some of them to speed up the research have an optimization phase before the normal walk of the engine (ie:character by character for the string, token by token for the pattern) called "transmission" by J.Friedl, where for instance positions of literal strings of the pattern are searched with a fast algorithm in the string before, but it isn't always possible and I doubt that the re module has many of these features. However, regex engines that produce a DFA always exists and are used with lex, MySQL, egrep... Commented Jul 12, 2016 at 12:00

1 Answer 1

12

Consider this example:

import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stop_string = "|".join(stop_list)
stop_word_pattern = re.compile(stop_string)

If you try to print the pattern, you'll see something like:

>>> print(stop_word_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)

which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:

>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

As explained in the comments, you can return a complete pattern with .pattern, e.g.:

stopword_pattern.pattern
Sign up to request clarification or add additional context in comments.

3 Comments

stopword_pattern.pattern should contain the complete string he's looking for. (This is in Python 2.6 and 3.1, where compiled regexes don't appear to have custom __str__ or __repr__ methods. It may have changed since then.)
Good to know. The above was tested in 3.4; I can confirm that in 2.6, at least, object.__repr__ is used to output a generic instance string.
@mquantin (and others) -- fyi: running in 3.9 a ~1000 chars long alternation pattern is getting badly cut off when printed (print(pat)). With .pattern it comes out whole :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.