Parse a string with key expressions

Question

I'm trying to parse strings that have this format:

sample = '<STATUS="OK" VERSION="B" MESSAGE="Connected in demo mode"><timestamp="1602765370" id="123">'

so that given a key I can get the associated value, for example:

parser('STATUS', sample)  # 'OK'
parser('MESSAGE', sample) # 'Connected in demo mode'

I've tried using re:

import re
def parser(key, string):
    return re.search(f'(?<={key}=)\S+', string).group()

but the results are '"OK"' for the first example and just '"Connected' for the second. How can I avoid retrieving the quotes and get the full string associated for each value? thanks in advance.

Is that actually supposed to be XML? If so, did you try using an XML parser? — Karl Knechtel
– Karl Knechtel, Commented Oct 15, 2020 at 13:07
I am receiving this data threw a requests.request('GET', url) and it comes as a string (larger then my example but in this format '<...><...><...>'). I've tried using xml.etree.ElementTree.fromstring(sample, parser=parser) but I get this error: 'xml.etree.ElementTree.ParseError: not well-formed (invalid token)'. I'm not familiar with xml so I didn't go further with this approach. Do you think python xml parsers is a better way to do it then re? — tatarana
– tatarana, Commented Oct 15, 2020 at 13:38

maarten · Accepted Answer · 2020-10-16 08:31:55Z

If the values you want to retrieve are guaranteed to be double quoted strings, then the definition below should work. It allows for escaped quotes in strings, won't raise an exception when the key is not present, and won't give false positives if your key is a suffix of an existing key.

import re
def parser(key, string):
    m = re.search(fr'(?<![A-Z]){key}="(.*?)(?<!\\)"', string)
    if m:
        return m.group(1)

The first part of the regex, (?<![A-Z]), is a negative look-behind expression that only matches when no character in the A-Z range matches right before your key. It ensures that you don't get false positives when you query the string with a key that is a suffix of an existing key (e.g. US, which is a suffix of STATUS).

Returning the values without the quotes is simply a matter of including the quotes in the regex but outside of the regex group that you retrieve. That's what happens in the expression "(.*?)(?<!\\)". The regex group associated to the value you want to retrieve is (.*?). The (?<!\\) expression is a negative look-behind that ensures that the " at the end only matches when it is not preceded by a backslash.

Example:

sample = r'<STATUS="OK" VERSION="B" MESSAGE="User said \"hi!\""><timestamp="1602765370" id="123">'

[parser('STATUS', sample),
 parser('US', sample),
 parser('MESSAGE', sample)]

Output:

['OK', None, 'User said \\"hi!\\"']

Wups · Accepted Answer · 2020-10-15 13:37:48Z

2

This returns everything within "" after a given key.

import re

def get_value(key, string):
    return re.search(f'{key} *= *"(.*?)"', string).group(1)

Add some error handling, to make it more robust.

answered Oct 15, 2020 at 13:37

Wups

2,5791 gold badge8 silver badges17 bronze badges

Comments

Jack Fleeting · Accepted Answer · 2020-10-15 13:27:10Z

0

Assuming this isn't xml/html (sample is invalid for these), you can use this method, without using regex. It's a little convoluted, but it works - at least in this case:

keys = ['STATUS','MESSAGE']
targets = sample.split('><')[0].split('"')
for k,v in zip(targets[::2],targets[1::2]):
    for key in keys:
        if key in k:
            print(k.replace('<','').replace('=','').strip(),'---',v)

Output:

STATUS --- OK
MESSAGE --- Connected in demo mode

answered Oct 15, 2020 at 13:27

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Collectives™ on Stack Overflow

Parse a string with key expressions

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related