Python regex pattern max length in re.compile?

Question

I try to compile a big pattern with re.compile in Python 3.

The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words

Python doesn't raise any error.

What I do is:

stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)

The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!

Is there a max length for the regex pattern?

Could you post a full working example program? This is impossible to reproduce right now. — orlp
– orlp, Commented May 13, 2015 at 17:49
I think you're confusing the string representation of stopword_pattern with the pattern it actually stores internally. — chepner
– chepner, Commented May 13, 2015 at 17:51
@Grief: The Python re module doesn't work the way you think. Most of the regex engines used in modern languages (Perl, Python, PHP, Java...) don't generate a DFA. The main reasons are to offer a better control of the way the regex engine will search the string, to reduce the compilation time and to provide features that are impossible (or don't make sense) with a DFA regex engine (backreferences, atomic grouping, non-greedy quantifiers, backtracking...). The counter part of this choice is that these engines work in a more silly way and the search is slower in some situations. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jul 12, 2016 at 11:41
@Grief: In particular they don't work in parallel. Some of them to speed up the research have an optimization phase before the normal walk of the engine (ie:character by character for the string, token by token for the pattern) called "transmission" by J.Friedl, where for instance positions of literal strings of the pattern are searched with a fast algorithm in the string before, but it isn't always possible and I doubt that the re module has many of these features. However, regex engines that produce a DFA always exists and are used with lex, MySQL, egrep... — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jul 12, 2016 at 12:00

ASCIImo · Accepted Answer · 2023-12-09 19:43:09Z

12

Consider this example:

import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stop_string = "|".join(stop_list)
stop_word_pattern = re.compile(stop_string)

If you try to print the pattern, you'll see something like:

>>> print(stop_word_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)

which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:

>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

As explained in the comments, you can return a complete pattern with .pattern, e.g.:

stopword_pattern.pattern

edited Dec 9, 2023 at 19:43

ASCIImo

3352 silver badges11 bronze badges

answered May 13, 2015 at 17:56

chepner

538k77 gold badges594 silver badges746 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kevin J. Chase Over a year ago

stopword_pattern.pattern should contain the complete string he's looking for. (This is in Python 2.6 and 3.1, where compiled regexes don't appear to have custom __str__ or __repr__ methods. It may have changed since then.)

chepner Over a year ago

Good to know. The above was tested in 3.4; I can confirm that in 2.6, at least, object.__repr__ is used to output a generic instance string.

zdim Over a year ago

@mquantin (and others) -- fyi: running in 3.9 a ~1000 chars long alternation pattern is getting badly cut off when printed (print(pat)). With .pattern it comes out whole :)

Collectives™ on Stack Overflow

Python regex pattern max length in re.compile?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related