1

I'm trying to parse strings that have this format:

sample = '<STATUS="OK" VERSION="B" MESSAGE="Connected in demo mode"><timestamp="1602765370" id="123">'

so that given a key I can get the associated value, for example:

parser('STATUS', sample)  # 'OK'
parser('MESSAGE', sample) # 'Connected in demo mode'

I've tried using re:

import re
def parser(key, string):
    return re.search(f'(?<={key}=)\S+', string).group()

but the results are '"OK"' for the first example and just '"Connected' for the second. How can I avoid retrieving the quotes and get the full string associated for each value? thanks in advance.

2
  • 2
    Is that actually supposed to be XML? If so, did you try using an XML parser? Commented Oct 15, 2020 at 13:07
  • I am receiving this data threw a requests.request('GET', url) and it comes as a string (larger then my example but in this format '<...><...><...>'). I've tried using xml.etree.ElementTree.fromstring(sample, parser=parser) but I get this error: 'xml.etree.ElementTree.ParseError: not well-formed (invalid token)'. I'm not familiar with xml so I didn't go further with this approach. Do you think python xml parsers is a better way to do it then re? Commented Oct 15, 2020 at 13:38

3 Answers 3

2

If the values you want to retrieve are guaranteed to be double quoted strings, then the definition below should work. It allows for escaped quotes in strings, won't raise an exception when the key is not present, and won't give false positives if your key is a suffix of an existing key.

import re
def parser(key, string):
    m = re.search(fr'(?<![A-Z]){key}="(.*?)(?<!\\)"', string)
    if m:
        return m.group(1) 

The first part of the regex, (?<![A-Z]), is a negative look-behind expression that only matches when no character in the A-Z range matches right before your key. It ensures that you don't get false positives when you query the string with a key that is a suffix of an existing key (e.g. US, which is a suffix of STATUS).

Returning the values without the quotes is simply a matter of including the quotes in the regex but outside of the regex group that you retrieve. That's what happens in the expression "(.*?)(?<!\\)". The regex group associated to the value you want to retrieve is (.*?). The (?<!\\) expression is a negative look-behind that ensures that the " at the end only matches when it is not preceded by a backslash.

Example:

sample = r'<STATUS="OK" VERSION="B" MESSAGE="User said \"hi!\""><timestamp="1602765370" id="123">'

[parser('STATUS', sample),
 parser('US', sample),
 parser('MESSAGE', sample)]                                     

Output:

['OK', None, 'User said \\"hi!\\"']

Sign up to request clarification or add additional context in comments.

Comments

2

This returns everything within "" after a given key.

import re

def get_value(key, string):
    return re.search(f'{key} *= *"(.*?)"', string).group(1)

Add some error handling, to make it more robust.

Comments

0

Assuming this isn't xml/html (sample is invalid for these), you can use this method, without using regex. It's a little convoluted, but it works - at least in this case:

keys = ['STATUS','MESSAGE']
targets = sample.split('><')[0].split('"')
for k,v in zip(targets[::2],targets[1::2]):
    for key in keys:
        if key in k:
            print(k.replace('<','').replace('=','').strip(),'---',v)

Output:

STATUS --- OK
MESSAGE --- Connected in demo mode

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.