Python, Complex Regex Parser

Question

So I am looking at parsing through a code using regular expressions and am wondering if there is an easier way to do it than what I have so far. I'll start with an example of a string I would be parsing through:

T16F161A286161990200040000\r (It's data coming through a serial device)

Now first I need to check the confirmation code, which are the first 9 characters of the code. They need to be exactly T16F161A2. If those 9 characters match exactly, I need to check the next 3 chracters which need to be either 861 or 37F.

If those 3 characters are 37F I have it do something I still need to code, so we won't worry about that result.

However if those 3 characters are 861 I need it to check the 2 characters after those and see what they are. They can be 11, 14, 60, 61, F0, F1, or F2. Each one of these does different things with the data preceeding it.

Finally I need to loop through the remaining characters, pairing each 2 of them together.

For an example of how this works, here is the code I've thrown together to parse through the example string I posted above:

import re

test_string = "T16F161A286161990200040000\r"

if re.match('^T16F161A2.*', test_string):
    print("Match: ", test_string)
    test_string = re.sub('^T16F161A2', '', test_string)
    if re.match('^861.*', test_string):
        print("Found '861': ", test_string)
        test_string = re.sub('^861', '', test_string)
        if re.match('^61.*', test_string):
            print("Found '61' : ", test_string)
            test_string = re.sub('^61', '', test_string)
            for i in range(6):
                if re.match('^[0-9A-F]{2}', test_string):
                    temp = re.match('^[0-9A-F]{2}', test_string).group()
                    print("Found Code: ", temp)
                test_string = re.sub('^[0-9A-F]{2}', '', test_string)

Now as you can see in this code, after every step I am using re.sub() to remove the part of the string I had just been looking for. With that in mind my question is the following:

Is there a way to parse the string and find the data I need, while also keeping the string intact? Would it be more or less efficient that what I currently have?

Why are you even using regex for this? Since you know exactly where to look and what variants there are, just use slicing and a few if/elif statements. — tobias_k
– tobias_k, Commented Jul 31, 2017 at 13:18
@tobias_k Unless I am mistaken, Python doesn't have a case/switch statement as part of it's language. — Skitzafreak
– Skitzafreak, Commented Jul 31, 2017 at 13:19
Whoops, wrong language. Anyway, just use a bunch of if/elif statements or a dict. — tobias_k
– tobias_k, Commented Jul 31, 2017 at 13:20

Thomas Ayoub · Accepted Answer · 2017-07-31 13:29:03Z

2

You don't need a regex for this task, you can use if/else blocks and a few string substitutions :

test_string = "T16F161A286161990200040000\r"

def process(input):
  # does a few stuff with 11, 14, 60, 61, F0, F1, or F2
  return

def stringToArray(input):
  return [tempToken[i:i+2] for i in range(0, len(tempToken), 2)]



if not test_string.startswith('T16F161A2'):
  print ("Does not match")
  quit()
else:
  print ("Does match")

tempToken = test_string[9:]

if tempToken.startswith('861'):
  process(tempToken) #does stuff with 11, 14, 60, 61, F0, F1, or F2
  tempToken = tempToken[5:]

  print (stringToArray(tempToken))
else:
  pass

You can see it live here

answered Jul 31, 2017 at 13:29

Thomas Ayoub

29.6k16 gold badges98 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Skitzafreak Over a year ago

Okay, one small question about what you've posted here. I need to not include the \r as part of my parsed data for starters, so would I just change the for loop in stringToArray to len(tempToken - 2)?

Thomas Ayoub Over a year ago

nop, change it to len(tempToken) - 1 because \r is only one char @Skitzafreak

Hampus Londögård · Accepted Answer · 2017-07-31 13:18:21Z

0

I'd recommend (because you know the size of string) to instead first:

Check first 9 by comparing test_string[:9] == T16F161A2

I'd do this for the second phase too (test_string[9:12]). This comparison is much faster than regex actually.

When using a known size you can call your string as I did above. This won't "ruin" your string as you do now. I.e. re.search(pattern, test_string[9:12]).

Hope this helps you a bit at least. :)

answered Jul 31, 2017 at 13:18

Hampus Londögård

998 bronze badges

Comments

tretyose · Accepted Answer · 2017-07-31 13:25:47Z

0

Assuming the string is the same length everytime and the data is located in the same index you can just use the strings [] splicer. To get the first 9 characters you would use:test_string[:10] You could set them as variables and make it easier for checking:

confirmation_code = test_string[:10]
nextThree = test_string[10:13]
#check values

This is a built in method in python so it's safe to say its pretty efficient.

answered Jul 31, 2017 at 13:25

tretyose

646 bronze badges

Comments

Simon Sagi · Accepted Answer · 2017-07-31 13:31:54Z

0

If you want to stick to regex then this can do:

pattern = re.compile(r'^T16F161A2((861)|37F)(?(2)(11|14|60|61|F0|F1|F2)|[0-9A-F]{2})([0-9A-F]{12})$')
match_result = pattern.match(test_string)

In this case you can check if match_result is a valid match object (if not, then there were no matching pattern). This match object will contain 4 elements: - first 3 grouping (861 or 37F) - useless data (ignore this) - 2 char code in case of first element is 861 (None otherwise) - last 12 digits

To split the last 12 digits a one liner:

last_12_digits = match_result[3]
last_digits = [last_12_digits[i:i+2] for i in range(0, len(last_12_digits), 2)]

answered Jul 31, 2017 at 13:31

Simon Sagi

4942 silver badges6 bronze badges

Comments

tobias_k · Accepted Answer · 2017-07-31 13:32:47Z

0

You don't really need regular expressions for this, since you know exactly what you are looking for and where it should be found in the string, you can just use slicing and a couple of if/elif/else statements. Something like this:

s = test_string.strip()
code, x, y, rest = s[:9], s[9:12], s[12:14], [s[i:i+2] for i in range(14, len(s), 2)]
# T16F161A2, 861, 61, ['99', '02', '00', '04', '00', '00']

if code == "T16F161A2":
    if x == "37F":
    elif x == "861":
        if y == "11":
            ...
        if y == "61":
            # do stuff with rest
    else:
        # invalid
else:
    # invalid

answered Jul 31, 2017 at 13:32

tobias_k

83.1k12 gold badges130 silver badges186 bronze badges

Comments

dashiell · Accepted Answer · 2017-07-31 13:33:48Z

0

Perhaps something like:

import re

regex = r'^T16F161A2(861|37f)(11|14|60|61|F0|F1|F2)(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})$'
string = 'T16F161A286161990200040000'

print re.match(regex,string).groups()

This makes use of capture groups and avoids having to create a bunch of new strings.

answered Jul 31, 2017 at 13:33

dashiell

8124 silver badges11 bronze badges

Comments

Serge Ballesta · Accepted Answer · 2017-07-31 13:45:47Z

The re module will not be as efficient as direct substring access, but it could save you to write (and maintain) some lines of code. But if you want to use it, you should match the string as a whole:

import re

test_string = "T16F161A286161990200040000\r"

rx = re.compile(r'T16F161A2(?:(?:(37F)(.*))|(?:(861)(11|14|60|61|F0|F1|F2)(.*)))\r')
m = rx.match(test_string)      # => 5 groups, first 2 if 37F, last 3 if 861

if m is None:                  # string does not match:
    ...
elif m.group(1) is None:       # 861 type
    subtype = m.group(4)       # extract subtype
    # and group remaining characters by pairs
    elts = [ m.group(5)[i:i+2] for i in range(0, len(m.group(5)), 2) ]
    ...                        # process that
else:                          # 37F type
    ...

Collectives™ on Stack Overflow

Python, Complex Regex Parser

7 Answers 7

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related