Optimizing pattern matching in python

Question

I have an application that scans through several GB worth of data, sometimes up to 10GB for IPv4, IPv6 and hostnames. Up until this point I have used 3 regex patterns that I match for every line of text in the dataset. This works fine when I have a small amount of data to go through but the amount of time needed for a big set is extremely large.

If I find a match I replace it with a value from a dict and I have optimized this by creating a database and processing the files in parallel using the multiprocess module. At this point I am looking at optimizing the actual search/replace functionality.

I've done some research and found potential solutions such as using a trie, the Rabin-Karp algorithm for detection and other search algorithms. My problem with these methods is that they expect a predefined list of potential patterns and in my case I can not store all possible IPv4 and IPv6 addresses in memory as the different possibilities are too many.

My patterns:

ipv4_pattern = (
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.'
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
    r'(25[0-5]|2[0-4][0-90]|[01]?[0-9][0-9]?)')


IPV4SEG = r'(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])'
IPV4ADDR = r'(?:(?:' + IPV4SEG + r'\.){3,3}' + IPV4SEG + r')'
IPV6SEG = r'(?:(?:[0-9a-fA-F]){1,4})'
IPV6GROUPS = (
    r'(?:' + IPV6SEG + r':){7,7}' + IPV6SEG,
    r'(?:' + IPV6SEG + r':){1,7}:',
    r'(?:' + IPV6SEG + r':){1,6}:' + IPV6SEG,
    r'(?:' + IPV6SEG + r':){1,5}(?::' + IPV6SEG + r'){1,2}',
    r'(?:' + IPV6SEG + r':){1,4}(?::' + IPV6SEG + r'){1,3}',
    r'(?:' + IPV6SEG + r':){1,3}(?::' + IPV6SEG + r'){1,4}',
    r'(?:' + IPV6SEG + r':){1,2}(?::' + IPV6SEG + r'){1,5}',
    IPV6SEG + r':(?:(?::' + IPV6SEG + r'){1,6})',
    r':(?:(?::' + IPV6SEG + r'){1,7}|:)',
    r'fe80:(?::' + IPV6SEG + r'){0,4}%[0-9a-zA-Z]{1,}',
    r'::(?:ffff(?::0{1,4}){0,1}:){0,1}[^\s:]' + IPV4ADDR,
    r'(?:' + IPV6SEG + r':){1,4}:[^\s:]' + IPV4ADDR,
)
IPV6ADDR = '|'.join(['(?:{})'.format(g) for g in IPV6GROUPS[::-1]]) 
ipv6_pattern = IPV6ADDR

hostname_pattern = r'(([a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63})'

For replacing I use:

def replace_line(line, generator):
    new_line = re.sub(ipv4_pattern, generator.ipvf, line)
    new_line = re.sub(ipv6_pattern, generator.ivps, new_line)
    new_line = re.sub(hostname_pattern, generator.hn, new_line)
    return new_line

With the generator looking up the appropriate substitute in the dictionary.

My input is logdumps that contain line-by-line raw text log from various programs.

Currently it is quite clear to me that the regex search is the bottleneck of my program and that for every additional pattern I add to the list, I increase the runtime by a significant amount.

Could I somehow integrate a regex pattern into a more optimized search algorithm or is there no way for me to optimize how I look for and replace strings?

Should also be mentioned that I am limited to python 2.7 and can not utilize external tools such as databases.

Thanks in advance

can you provide examples so as to know exactly what kind of pattern, what kind of data format, what kind of value to replace back etc. — thuyein
– thuyein, Commented May 14, 2020 at 8:54

Amitai Irron · Accepted Answer · 2020-05-14 09:28:03Z

1

First step to take is to utilize the re.compile() function.

I have had similar issues with regular expression matches taking up most of the run time of my code. Then I found that you can pre-compile regular expression, and it saves a lot of run-time. This is probably especially true in your case, since your regular expressions are long. I would replace your last bit of code with something like:

ivp4_re = re.compile(ipv4_pattern)
ivp6_re = re.compile(ipv6_pattern)
host_re = re.compile(hostname_pattern)

def replace_line(line, generator):
    new_line = ivp4_re.sub(generator.ipvf, line)
    new_line = ivp6_re.sub(generator.ivps, new_line)
    new_line = hostname_pattern.sub(generator.hn, new_line)
    return new_line

In some of my cases, where regex matches were dominant, this saved an order of magnitude in run-time. It seems like your case is similar.

answered May 14, 2020 at 9:28

Amitai Irron

2,1051 gold badge14 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jonaswg Over a year ago

Sadly not the case here. I already tried a solution using: patterns = [ipv4_pattern, ipv6_pattern, hostname_pattern] pattern = re.compile("|".join(patterns))

Amitai Irron Over a year ago

Did you also replace the calls to re.sub with calls to calls to the sub method of the precompiled objects? This worked so well for me, so it's quite surprising. Could be that Python is already caching the compiled expressions, so this optimization has actually already been done for you. Did you try testing the loop with NOOP code instead of the regex matching? Could it be that the problem is not in regex matching, but simply the large IO loop? Did you try running a profiler?

Jonaswg Over a year ago

Youre right! I used the re.sub method and it increased it by atleast 2x the speed! I had just compiled the pattern and used it in re.sub function. Thanks!

Pato Navarro · Accepted Answer · 2020-05-14 08:57:26Z

0

OK, for sure you will have to read all the data to look for a match, you can't avoid that. You can use "split-and-conquer" doing it in multiple theads/process/computers to speed up. I belive that regex library for python is optimized but you may check if removing all the lines without numbers or dots (chars that need to be present to match with a IPv4) and applying regex to the remaining may speed things up.

answered May 14, 2020 at 8:57

Pato Navarro

3403 silver badges11 bronze badges

Collectives™ on Stack Overflow

Optimizing pattern matching in python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related