6

I want to process every line in my log file, and extract IP address if line matches my pattern. There are several different types of messages, in example below I am using p1andp2`.

I could read the file line by line, and for each line match to each pattern. But Since there can be many more patterns, I would like to do it as efficiently as possible. I was hoping to compile thos patterns into one object, and do the match only once for each line:

import re

IP = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

p1 = 'Registration from' + IP + '- Wrong password' 
p2 = 'Call from' + IP + 'rejected because extension not found'

c = re.compile(r'(?:' + p1 + '|' + p2 + ')')

for line in sys.stdin:
    match = re.search(c, line)
    if match:
        print(match['ip'])

but the above code does not work, it complains that ip is used twice.

What is the most elegant way to achieve my goal ?

EDIT:

I have modified my code based on answer from @Dev Khadka.

But I am still struggling with how to properly handle the multiple ip matches. The code below prints all IPs that matched p1:

for line in sys.stdin:
    match = c.search(line)
    if match:
        print(match['ip1'])

But some lines don't match p1. They match p2. ie, I get:

1.2.3.4
None
2.3.4.5
...

How do I print the matching ip, when I don't know wheter it was p1, p2, ... ? All I want is the IP. I don't care which pattern it matched.

1
  • 1
    You should provide your test data. Commented Oct 21, 2019 at 3:21

5 Answers 5

2
+100

You can consider installing the excellent regex module, which supports many advanced regex features, including branch reset groups, designed to solve exactly the problem you outlined in this question. Branch reset groups are denoted by (?|...). All capture groups of the same positions or names in different alternative patterns within a branch reset grouop share the same capture groups for output.

Notice that in the example below the matching capture group becomes the named capture group, so that you don't need to iterate over multiple groups searching for a non-empty group:

import regex

ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
]
pattern = regex.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
    match = regex.search(pattern, line)
    if match:
        print(match['ip'])

Demo: https://repl.it/@blhsing/RegularEmbellishedBugs

Sign up to request clarification or add additional context in comments.

1 Comment

999.999.999.999 [Program finished] which is actually not a valid ip... should we use import ipaddress
2

why don't you check which regex matched?

if 'ip1' in match :
    print match['ip1']
if 'ip2' in match :
    print match['ip2']

or something like:

names = [ 'ip1', 'ip2', 'ip3' ]
for n in names :
    if n in match :
        print match[n]

or even

num = 1000   # can easily handle millions of patterns =)
for i in range(num) :
    name = 'ip%d' % i
    if name in match :
        print match[name]

5 Comments

but what if I have 100 patterns? Can I do this in a loop ? Can I itterate over the match[i] in a for loop ?
@MartinVegter see above
@MartinVegter can handle millions of patterns easily =)
I get an error: if match[name] is not None: IndexError: no such group
@MartinVegter try to use name in match instead
1

thats because you are using same group name for two group

try this, this will give group names ip1 and ip2

import re

IP = r'(?P<ip%d>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

p1 = 'Registration from' + IP%1 + '- Wrong password' 
p2 = 'Call from' + IP%2 + 'rejected because extension not found'

c = re.compile(r'(?:' + p1 + '|' + p2 + ')')

Comments

1

Named capture groups must have distinct names, but since all of your capture groups are meant to capture the same pattern, it's better not to use named capture groups in this case but instead simply use regular capture groups and iterate through the groups from the match object to print the first group that is not empty:

ip_pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
]
pattern = re.compile('|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
    match = re.search(pattern, line)
    if match:
        print(next(filter(None, match.groups())))

Demo: https://repl.it/@blhsing/UnevenCheerfulLight

Comments

0

Adding ip address validity to already accepted answer. Altho import ipaddress & import socket should be ideal ways, this code will parse-the-host,

import regex as re 
from io import StringIO



def valid_ip(address):
    try:
        host_bytes = address.split('.')
        valid = [int(b) for b in host_bytes]
        valid = [b for b in valid if b >= 0 and b<=255]
        return len(host_bytes) == 4 and len(valid) == 4
    except:
        return False
    
        
    
        

ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

patterns = patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
] 

file = StringIO('''
Registration from 259.1.1.1 - Wrong password,
    Call from 1.1.2.2 rejected because extension not found
''')

pattern = re.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))

list1 = []
list2 = []

for line in file:
    match = re.search(pattern, line)
    if match:
        list1.append(match['ip']) # List of ip address 
        list2.append(valid_ip(match['ip'])) # Boolean results of valid_ip 


for i in range(len(list1)):
        if list2[i] == False:
            print(f'{list1[i]} is invalid IP')
        else:
            print(list1[i])
259.1.1.1 is invalid IP
1.1.2.2

[Program finished]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.