1

In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.

In pseudo-C code, this is pretty-easy:

while (read(line))
{
    if (m=matchregex(regex1,line))
    {
         /* munch on the components extracted in regex1 by accessing m */
    }
    else if (m=matchregex(regex2,line))
    {
         /* munch on the components extracted in regex2 by accessing m */
    }
    else if ...
    ...
    else
    {
         error("Unrecognized line format");
    }
}

However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.

What I find myself doing instead is including code like this at the base level of every project:

im=None
img=None
def imps(p,s):
    global im
    global img
    im=re.search(p,s)
    if im:
        img=im.groups()
        return True
    else:
        img=None
        return False

Then I can work like this:

for line in open(file,'r').read().splitlines():
    if imps(regex1,line):
        # munch on contents of img
    elsif imps(regex2,line):
        # munch on contents of img
    else:
        error('Unrecognised line: {}'.format(line))

That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).

But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?

2 Answers 2

2

Depends on the needs of the code.

A common choice I use is something like this:

# note, order is important here. The first one to match will exit the processing
parse_regexps = [
    (r"^foo", handle_foo),
    (r"^bar", handle_bar),
]

for regexp, handler in parse_regexps:
    m = regexp.match(line)
    if m:
        handler(line)  # possibly other data too like m.groups
        break
else:
    error("Unrecognized format....")

This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.

Sign up to request clarification or add additional context in comments.

Comments

1

You can just use continue:

for line in file:
    m = re.match(re1, line)
    if m:
       do stuff
       continue

    m = re.match(re2, line)
    if m:
       do stuff
       continue

    raise BadLine

Another, less obvious, option is to have a function like this:

def match_any(subject, *regexes):
    for n, regex in enumerate(regexes):
        m = re.match(regex, subject)
        if m:
           return n, m
    return -1, None

and then:

for line in file:
    n, m = match_any(line, re1, re2)
    if n == 0:
       ....
    elif n == 1:
       ....
    else:
       raise BadLine

1 Comment

I like the continue solution and would use it more often if it wasn't for the common case that I often need to do something with every valid line after matching.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.