2

I'm using a dict file and Regular Expressions to change some words in a script but have now come across this error

Exception caught in plugin < class 'pagerprinter.plugins.tts.TTS' >
regular expression code size limit exceeded

my dict has some 5300 entries long set out as:

'SE': 'South East',
'NE': 'North East',

You get the idea changing abbreviations to full words. on average 6 - 8 abbreviations are changed.

for this I'm using

from abbreviations import abbreviations #mydict
pattern = re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')
    msg = pattern.sub(lambda x: abbreviations[x.group()], msg)

but I also use a further 4 more regexes for other tasks like removing words and numbers from the a number of strings.

What is the cause of the error I get? if I remove my dict it works if I have 300 entries it works.

looking into it from Google most people say that there are no limits on dict sizes.

16
  • I tried to reproduce your error using a 99,000 element dict (based on a list of English words), but the code worked fine. A more complete example would help, but that's tricky given the 5000-entry dictionary. Commented Oct 11, 2015 at 10:01
  • 1
    The limit is on the length of regular expressions, if I'm not mistaken. Just go through the dictionary in smaller chunks and do the replacements for each of them. Commented Oct 11, 2015 at 10:22
  • How do you mean length? As in code in one line? Commented Oct 11, 2015 at 10:26
  • @Roy Yorke the dict can be downloaded from git hub if required Commented Oct 11, 2015 at 10:27
  • I'm not quite sure, but I think there's simply a size limit for regular expressions. Commented Oct 11, 2015 at 10:28

1 Answer 1

2

Just as L3viathan mentions. You're building a regex pattern that is to long. This line is your problem:

re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')

The longer your abbreviations dict grows the longer the regex pattern grows. You'll have to either use 2 regexes or another solution.

Edit to answer a question below, you could do it like this:

from abbreviations import dct1, dct2, dct3
import re

for dct in (dct1, dct2, dct3):
    pattern = re.compile(r'\b(' + '|'.join(dct.keys()) + r')\b')
    msg = pattern.sub(lambda x: dct[x.group()], msg)

Where dct1 2 and 3 are you categories

Sign up to request clarification or add additional context in comments.

4 Comments

ok so i moved the above code above to one part of the script to find on 3 things in the list and i still got the error ?
is it possible to split the dict up ? and say look for road-use= {'RD': 'Road'} Directions= {'NE': 'North East'}
I'm guessing you don't have any context by which you can split the dict into the categories you suggest. You will either have to do that manually or split the dict into chunks
To do manually isnt hard as already in "sections" by use of # so if I changed it how would I accomplish said way?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.