1

I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, e.g., ";" or "; ", or even " ; ". There will always be a semi-colon between pairs, and the string is terminated with a semi-colon.

Keys and values are separated by whitespace.

This string is flat. There's never anything nested. Strings are always quoted and numerical values are never quoted. I can count on this being consistent in the input. So for example,

'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'

Ultimately this winds up as

{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}

Different strings may include different key/value pairs, and I can't know in advance which keys will be present. So this is equally valid input string:

mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";

I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. Something like

x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
    d[x[i]] = d[x[i+1]]

Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. But I can't figure out a regex to get in this form. Closest I have is

([^;[\s]*]+)

Which returns

['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']

Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. Any suggestions?

2 Answers 2

1

It might be easier to use findall() instead of split() here. This will let you use a capture group to pull out just the part you want. Then you can split the groups, cleanup, etc:

import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)

d = {}
for k, v in pairs:
    if  v.isdigit():
        v = int(v)
    else:
        v = v.strip('"')
    d[k] = v
print(d)

result

{'cheese': 'stilton',
 'pigeons': 17,
 'color': 'blue',
 'why': 'because I said so'}

This, of course, assumes you aren't using ; anywhere in the data.

Sign up to request clarification or add additional context in comments.

Comments

1

You may use

r'(\w+)\s+("[^"]*"|[^\s;]+)'

to match and extract your data with re.findall, and post-process Group 2 values to remove one trailing and one leading " chars if the first alternative matched, and then create a dictionary entry.

See the regex demo.

Details

  • (\w+) - Group 1 (key): one or more word chars
  • \s+ - 1+ whitespace chars
  • ("[^"]*"|[^\s;]+) - Group 2: ", 0+ chars other than " and then a " or 1 or more chars other than whitespace and ;

Python demo:

import re
rx = r'(\w+)\s+("[^"]*"|[^\s;]+)'
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
result = {}
for key,val in re.findall(rx, s):
    if val.startswith('"') and val.endswith('"'):
        val = val[1:-1]
    result[key]=val

print(result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.