regular expression code size limit exceeded python [duplicate]

Question

I'm using a dict file and Regular Expressions to change some words in a script but have now come across this error

Exception caught in plugin < class 'pagerprinter.plugins.tts.TTS' >
regular expression code size limit exceeded

my dict has some 5300 entries long set out as:

'SE': 'South East',
'NE': 'North East',

You get the idea changing abbreviations to full words. on average 6 - 8 abbreviations are changed.

for this I'm using

from abbreviations import abbreviations #mydict
pattern = re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')
    msg = pattern.sub(lambda x: abbreviations[x.group()], msg)

but I also use a further 4 more regexes for other tasks like removing words and numbers from the a number of strings.

What is the cause of the error I get? if I remove my dict it works if I have 300 entries it works.

looking into it from Google most people say that there are no limits on dict sizes.

I tried to reproduce your error using a 99,000 element dict (based on a list of English words), but the code worked fine. A more complete example would help, but that's tricky given the 5000-entry dictionary. — Rory Yorke
– Rory Yorke, Commented Oct 11, 2015 at 10:01
The limit is on the length of regular expressions, if I'm not mistaken. Just go through the dictionary in smaller chunks and do the replacements for each of them. — L3viathan
– L3viathan, Commented Oct 11, 2015 at 10:22
@Roy Yorke the dict can be downloaded from git hub if required — shaggs
– shaggs, Commented Oct 11, 2015 at 10:27
I'm not quite sure, but I think there's simply a size limit for regular expressions. — L3viathan
– L3viathan, Commented Oct 11, 2015 at 10:28

Sjuul Janssen · Accepted Answer · 2015-10-11 12:12:03Z

2

Just as L3viathan mentions. You're building a regex pattern that is to long. This line is your problem:

re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')

The longer your abbreviations dict grows the longer the regex pattern grows. You'll have to either use 2 regexes or another solution.

Edit to answer a question below, you could do it like this:

from abbreviations import dct1, dct2, dct3
import re

for dct in (dct1, dct2, dct3):
    pattern = re.compile(r'\b(' + '|'.join(dct.keys()) + r')\b')
    msg = pattern.sub(lambda x: dct[x.group()], msg)

Where dct1 2 and 3 are you categories

edited Oct 11, 2015 at 12:12

answered Oct 11, 2015 at 11:31

Sjuul Janssen

1,8121 gold badge15 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

shaggs Over a year ago

ok so i moved the above code above to one part of the script to find on 3 things in the list and i still got the error ?

shaggs Over a year ago

is it possible to split the dict up ? and say look for road-use= {'RD': 'Road'} Directions= {'NE': 'North East'}

Sjuul Janssen Over a year ago

I'm guessing you don't have any context by which you can split the dict into the categories you suggest. You will either have to do that manually or split the dict into chunks

shaggs Over a year ago

To do manually isnt hard as already in "sections" by use of # so if I changed it how would I accomplish said way?

Collectives™ on Stack Overflow

regular expression code size limit exceeded python [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related