I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, e.g., ";" or "; ", or even " ; ". There will always be a semi-colon between pairs, and the string is terminated with a semi-colon.
Keys and values are separated by whitespace.
This string is flat. There's never anything nested. Strings are always quoted and numerical values are never quoted. I can count on this being consistent in the input. So for example,
'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
Ultimately this winds up as
{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}
Different strings may include different key/value pairs, and I can't know in advance which keys will be present. So this is equally valid input string:
mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";
I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. Something like
x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
d[x[i]] = d[x[i+1]]
Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. But I can't figure out a regex to get in this form. Closest I have is
([^;[\s]*]+)
Which returns
['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']
Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. Any suggestions?