1

So I have a key value file that's similar to JSON's format but it's different enough to not be picked up by the Python JSON parser.

Example:

"Matt"
{
    "Location"    "New York"
    "Age"         "22"
    "Items"
    {
        "Banana"    "2"
        "Apple"     "5"
        "Cat"       "1"
    }
}

Is there any easy way to parse this text file and store the values into an array such that I could access the data using a format similar to Matt[Items][Banana]? There is only to be one pair per line and a bracket should denote going down a level and going up a level.

4
  • This is surprising messy, because items can have variable number of content.. Ok, if you look at len(line.split("\t")), length of 1 means an object coming up as value of a key next few lines, length of 2 means a simple key-value pair of literals, while matching braces can define the object boundary for you. Write an iterative/recursive parser based on this should work, but it's a lot more trouble than using an existing parser. I don't wanna write it for you :D Commented Oct 12, 2015 at 0:19
  • If you were to add a : between " and " or " and {. And added a , after a " followed by a newline and another " then you be pretty much right back to json. I.e. Couldn't you auto transform your incoming file into json? Commented Oct 12, 2015 at 0:22
  • Are there any existing parsers that could do this? I'd like to avoid writing my own parser if possible. Commented Oct 12, 2015 at 0:23
  • Dunno. But it hardly seems very hard to do. As long as your format doesn't stray too far from what you are showing. You will need to read ahead one line of the line you are adjusting. Write up a unit test feed it some data and check expectations. Alternatively you could search by the name of the emitting app to see if someone has written this n Commented Oct 12, 2015 at 0:27

2 Answers 2

3

You could use re.sub to 'fix up' your string and then parse it. As long as the format is always either a single quoted string or a pair of quoted strings on each line, you can use that to determine where to place commas and colons.

import re
s = """"Matt"
{
    "Location"    "New York"
    "Age"         "22"
    "Items"
    {
        "Banana"    "2"
        "Apple"     "5"
        "Cat"       "1"
    }
}"""

# Put a colon after the first string in every line
s1 = re.sub(r'^\s*(".+?")', r'\1:', s, flags=re.MULTILINE)
# add a comma if the last non-whitespace character in a line is " or }
s2 = re.sub(r'(["}])\s*$', r'\1,', s1, flags=re.MULTILINE)

Once you've done that, you can use ast.literal_eval to turn it into a Python dict. I use that over JSON parsing because it allows for trailing commas, without which the decision of where to put commas becomes a lot more complicated:

import ast
data = ast.literal_eval('{' + s2 + '}')
print data['Matt']['Items']['Banana']
# 2
Sign up to request clarification or add additional context in comments.

1 Comment

Sweet. I like skipping the trailing comma or not issue.
0

Not sure how robust this approach is outside of the example you've posted but it does support for escaped characters and deeper levels of structured data. It's probably not going to be fast enough for large amounts of data.

The approach converts your custom data format to JSON using a (very) simple parser to add the required colons and braces, the JSON data can then be converted to a native Python dictionary.

import json

# Define the data that needs to be parsed
data = '''
"Matt"
{
    "Location"    "New \\"York"
    "Age"         "22"
    "Items"
    {
        "Banana"    "2"
        "Apple"     "5"
        "Cat"
        {
            "foo"   "bar"
        }
    }
}
'''

# Convert the data from custom format to JSON
json_data = ''

# Define parser states
state = 'OUT'
key_or_value = 'KEY'

for c in data:
    # Handle quote characters
    if c == '"':
        json_data += c

        if state == 'IN':
            state = 'OUT'
            if key_or_value == 'KEY':
                key_or_value = 'VALUE'
                json_data += ':'

            elif key_or_value == 'VALUE':
                key_or_value = 'KEY'
                json_data += ','

        else:
            state = 'IN'

    # Handle braces
    elif c == '{':
        if state == 'OUT':
            key_or_value = 'KEY'
        json_data += c

    elif c == '}':
        # Strip trailing comma and add closing brace and comma
        json_data = json_data.rstrip().rstrip(',') + '},'

    # Handle escaped characters
    elif c == '\\':
        state = 'ESCAPED'
        json_data += c

    else:
        json_data += c

# Strip trailing comma
json_data = json_data.rstrip().rstrip(',')

# Wrap the data in braces to form a dictionary
json_data = '{' + json_data + '}'

# Convert from JSON to the native Python
converted_data = json.loads(json_data)

print(converted_data['Matt']['Items']['Banana'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.