0

I have an application that scans through several GB worth of data, sometimes up to 10GB for IPv4, IPv6 and hostnames. Up until this point I have used 3 regex patterns that I match for every line of text in the dataset. This works fine when I have a small amount of data to go through but the amount of time needed for a big set is extremely large.

If I find a match I replace it with a value from a dict and I have optimized this by creating a database and processing the files in parallel using the multiprocess module. At this point I am looking at optimizing the actual search/replace functionality.

I've done some research and found potential solutions such as using a trie, the Rabin-Karp algorithm for detection and other search algorithms. My problem with these methods is that they expect a predefined list of potential patterns and in my case I can not store all possible IPv4 and IPv6 addresses in memory as the different possibilities are too many.

My patterns:

ipv4_pattern = (
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.'
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
    r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
    r'(25[0-5]|2[0-4][0-90]|[01]?[0-9][0-9]?)')


IPV4SEG = r'(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])'
IPV4ADDR = r'(?:(?:' + IPV4SEG + r'\.){3,3}' + IPV4SEG + r')'
IPV6SEG = r'(?:(?:[0-9a-fA-F]){1,4})'
IPV6GROUPS = (
    r'(?:' + IPV6SEG + r':){7,7}' + IPV6SEG,
    r'(?:' + IPV6SEG + r':){1,7}:',
    r'(?:' + IPV6SEG + r':){1,6}:' + IPV6SEG,
    r'(?:' + IPV6SEG + r':){1,5}(?::' + IPV6SEG + r'){1,2}',
    r'(?:' + IPV6SEG + r':){1,4}(?::' + IPV6SEG + r'){1,3}',
    r'(?:' + IPV6SEG + r':){1,3}(?::' + IPV6SEG + r'){1,4}',
    r'(?:' + IPV6SEG + r':){1,2}(?::' + IPV6SEG + r'){1,5}',
    IPV6SEG + r':(?:(?::' + IPV6SEG + r'){1,6})',
    r':(?:(?::' + IPV6SEG + r'){1,7}|:)',
    r'fe80:(?::' + IPV6SEG + r'){0,4}%[0-9a-zA-Z]{1,}',
    r'::(?:ffff(?::0{1,4}){0,1}:){0,1}[^\s:]' + IPV4ADDR,
    r'(?:' + IPV6SEG + r':){1,4}:[^\s:]' + IPV4ADDR,
)
IPV6ADDR = '|'.join(['(?:{})'.format(g) for g in IPV6GROUPS[::-1]]) 
ipv6_pattern = IPV6ADDR

hostname_pattern = r'(([a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63})'

For replacing I use:

def replace_line(line, generator):
    new_line = re.sub(ipv4_pattern, generator.ipvf, line)
    new_line = re.sub(ipv6_pattern, generator.ivps, new_line)
    new_line = re.sub(hostname_pattern, generator.hn, new_line)
    return new_line

With the generator looking up the appropriate substitute in the dictionary.

My input is logdumps that contain line-by-line raw text log from various programs.

Currently it is quite clear to me that the regex search is the bottleneck of my program and that for every additional pattern I add to the list, I increase the runtime by a significant amount.

Could I somehow integrate a regex pattern into a more optimized search algorithm or is there no way for me to optimize how I look for and replace strings?

Should also be mentioned that I am limited to python 2.7 and can not utilize external tools such as databases.

Thanks in advance

2
  • 1
    can you provide examples so as to know exactly what kind of pattern, what kind of data format, what kind of value to replace back etc. Commented May 14, 2020 at 8:54
  • I have added the code snippets that contain my patterns. Commented May 14, 2020 at 9:14

2 Answers 2

1

First step to take is to utilize the re.compile() function.

I have had similar issues with regular expression matches taking up most of the run time of my code. Then I found that you can pre-compile regular expression, and it saves a lot of run-time. This is probably especially true in your case, since your regular expressions are long. I would replace your last bit of code with something like:

ivp4_re = re.compile(ipv4_pattern)
ivp6_re = re.compile(ipv6_pattern)
host_re = re.compile(hostname_pattern)

def replace_line(line, generator):
    new_line = ivp4_re.sub(generator.ipvf, line)
    new_line = ivp6_re.sub(generator.ivps, new_line)
    new_line = hostname_pattern.sub(generator.hn, new_line)
    return new_line

In some of my cases, where regex matches were dominant, this saved an order of magnitude in run-time. It seems like your case is similar.

Sign up to request clarification or add additional context in comments.

3 Comments

Sadly not the case here. I already tried a solution using: patterns = [ipv4_pattern, ipv6_pattern, hostname_pattern] pattern = re.compile("|".join(patterns))
Did you also replace the calls to re.sub with calls to calls to the sub method of the precompiled objects? This worked so well for me, so it's quite surprising. Could be that Python is already caching the compiled expressions, so this optimization has actually already been done for you. Did you try testing the loop with NOOP code instead of the regex matching? Could it be that the problem is not in regex matching, but simply the large IO loop? Did you try running a profiler?
Youre right! I used the re.sub method and it increased it by atleast 2x the speed! I had just compiled the pattern and used it in re.sub function. Thanks!
0

OK, for sure you will have to read all the data to look for a match, you can't avoid that. You can use "split-and-conquer" doing it in multiple theads/process/computers to speed up. I belive that regex library for python is optimized but you may check if removing all the lines without numbers or dots (chars that need to be present to match with a IPv4) and applying regex to the remaining may speed things up.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.