I have an application that scans through several GB worth of data, sometimes up to 10GB for IPv4, IPv6 and hostnames. Up until this point I have used 3 regex patterns that I match for every line of text in the dataset. This works fine when I have a small amount of data to go through but the amount of time needed for a big set is extremely large.
If I find a match I replace it with a value from a dict and I have optimized this by creating a database and processing the files in parallel using the multiprocess module. At this point I am looking at optimizing the actual search/replace functionality.
I've done some research and found potential solutions such as using a trie, the Rabin-Karp algorithm for detection and other search algorithms. My problem with these methods is that they expect a predefined list of potential patterns and in my case I can not store all possible IPv4 and IPv6 addresses in memory as the different possibilities are too many.
My patterns:
ipv4_pattern = (
r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.'
r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' +
r'(25[0-5]|2[0-4][0-90]|[01]?[0-9][0-9]?)')
IPV4SEG = r'(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])'
IPV4ADDR = r'(?:(?:' + IPV4SEG + r'\.){3,3}' + IPV4SEG + r')'
IPV6SEG = r'(?:(?:[0-9a-fA-F]){1,4})'
IPV6GROUPS = (
r'(?:' + IPV6SEG + r':){7,7}' + IPV6SEG,
r'(?:' + IPV6SEG + r':){1,7}:',
r'(?:' + IPV6SEG + r':){1,6}:' + IPV6SEG,
r'(?:' + IPV6SEG + r':){1,5}(?::' + IPV6SEG + r'){1,2}',
r'(?:' + IPV6SEG + r':){1,4}(?::' + IPV6SEG + r'){1,3}',
r'(?:' + IPV6SEG + r':){1,3}(?::' + IPV6SEG + r'){1,4}',
r'(?:' + IPV6SEG + r':){1,2}(?::' + IPV6SEG + r'){1,5}',
IPV6SEG + r':(?:(?::' + IPV6SEG + r'){1,6})',
r':(?:(?::' + IPV6SEG + r'){1,7}|:)',
r'fe80:(?::' + IPV6SEG + r'){0,4}%[0-9a-zA-Z]{1,}',
r'::(?:ffff(?::0{1,4}){0,1}:){0,1}[^\s:]' + IPV4ADDR,
r'(?:' + IPV6SEG + r':){1,4}:[^\s:]' + IPV4ADDR,
)
IPV6ADDR = '|'.join(['(?:{})'.format(g) for g in IPV6GROUPS[::-1]])
ipv6_pattern = IPV6ADDR
hostname_pattern = r'(([a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63})'
For replacing I use:
def replace_line(line, generator):
new_line = re.sub(ipv4_pattern, generator.ipvf, line)
new_line = re.sub(ipv6_pattern, generator.ivps, new_line)
new_line = re.sub(hostname_pattern, generator.hn, new_line)
return new_line
With the generator looking up the appropriate substitute in the dictionary.
My input is logdumps that contain line-by-line raw text log from various programs.
Currently it is quite clear to me that the regex search is the bottleneck of my program and that for every additional pattern I add to the list, I increase the runtime by a significant amount.
Could I somehow integrate a regex pattern into a more optimized search algorithm or is there no way for me to optimize how I look for and replace strings?
Should also be mentioned that I am limited to python 2.7 and can not utilize external tools such as databases.
Thanks in advance