Pyparsing: Parsing nested plaintext data with a key=value

Question

I'm new to python and I'm trying to parse some data using pyparsing that looks like this

string2 = """
object1 {
        key1 = value1
        key2 = value2
        #key3 = value3
        key4 = value4
        #key5 = value5
        key6 = value6
        subobject1 {
            key1 = value1
            key2 = value2
            key3 = value3
        }
}
"""

And I can get a key=value pair using this code

def parse_objects(source):
    LBRACE,EQ,RBRACE,HASH = map(Suppress, '{=}#')
    object_name = Word(printables)
    #disable = MatchFirst(map(Literal, '#'.split()))
    key = Word(printables)
    value = Word(printables)

    if LineStart() == HASH:
        key_and_value = Group(HASH + key('key') + EQ + value('value'))
    else:
        key_and_value = Group(key('key') + EQ + value('value'))

    collection = Forward()
    object_body = Group(LBRACE + ZeroOrMore(collection | key_and_value) + RBRACE)
    collection <<= Group(object_name + object_body)

    return collection.parseString(source)

collection = parse_objects(string2)
print(collection.dump())

But I also need to parsing data that does not contain values in objects, only keys. For example

object1 {
        key1 = value1
        key2
        #key3 = value3
        key4
        #key5 = value5
        key6 = value6
        subobject1 {
            key1 = value1
            key2 = value2
            key3 = value3
        }
}

I tried to make changes to the code and add the checking expression if value is None. Something like this

if value is None:
    key_and_value = Group(key('key'))
else:
    if LineStart() == HASH:
        key_and_value = Group(HASH + key('key') + EQ + value('value'))
    else:
        key_and_value = Group(key('key') + EQ + value('value'))

but I get an error

Match W:(0123...) at loc 19(3,9)
Matched W:(0123...) -> ['key1']
Match W:(0123...) at loc 25(3,15)
Matched W:(0123...) -> ['value1']
Match W:(0123...) at loc 41(4,9)
Matched W:(0123...) -> ['key2']
Traceback (most recent call last):
  File "c:\Python27\my_projects\test_parser.py", line 86, in <module>
    collection = parse_objects(string2)
  File "c:\Python27\my_projects\test_parser.py", line 84, in parse_objects
    return collection.parseString(source)
  File "C:\Python27\lib\site-packages\pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected "}" (at char 41), (line:4, col:9)

I think that pyparsing takes the key as the subobject and does not find {. Can anyone give me any advices? Maybe I need to change my approach to the grammar? I appreciate any help.

Edit 1

@Jappy's solution works great for data that I wrote above, when subobject1 section at the bottom of the main section. After analyzing my data, I found that after the subobject1 section there may be more key=value pairs or only the keys, something like this:

string2 = """
object1 {
        key1 = value1
        key2
        #key3 = value3
        key4 = value4
        subobject1 {
            key1 = value1
            key2 = value2
            key3 = value3
        }        
        #key5 = value5
        key6 = v_a_l_u_e_6
        subobject2 {
            key1 = value1
        }
        key7 = value7
        key8
}
"""

Output will be following:

[['object1', ['key1', 'value1'], ['key2', 'null'], ['#key3', 'value3'], ['key4', 'value4'], ['subobject1', ['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']], ['#key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', ['key1', 'value1']], ['key7', 'value7'], ['key8', 'null']]]
- objects: ['object1', ['key1', 'value1'], ['key2', 'null'], ['#key3', 'value3'],
['key4', 'value4'], ['subobject1', ['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']], ['#key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', ['key1', 'value1']], ['key7', 'value7'], ['key8', 'null']]
  - key_val_lines: [['key7', 'value7'], ['key8', 'null']]
    [0]:
      ['key7', 'value7']
      - key: 'key7'
      - val: 'value7'
    [1]:
      ['key8', 'null']
      - key: 'key8'
      - val: 'null'
  - obj_name: 'object1'
  - objects: ['subobject2', ['key1', 'value1']]
    - key_val_lines: [['key1', 'value1']]
      [0]:
        ['key1', 'value1']
        - key: 'key1'
        - val: 'value1'
    - obj_name: 'subobject2'

I changed the code like this:

ParserElement.inlineLiteralsUsing(Suppress)
name_expr = Word(printables, excludeChars='{}')
key_val_expr = '=' + Word(printables)

key_val_line = Group(name_expr('key') + (lineEnd().setParseAction(lambda t: 'null') | key_val_expr)('val'))
#key_val_lines = OneOrMore(key_val_line)('key_val_lines')

obj = Forward()
objects = Group('{' + OneOrMore(key_val_line | obj) + '}')
obj << Group(name_expr('obj_name') + objects('objects'))
#obj << Group(name_expr('obj_name') + '{' + OneOrMore(key_val_lines | obj) + '}')('objects')

o = obj.parseString(string2)
print o.dump()

And the result is:

[['object1', [['key1', 'value1'], ['key2', 'null'], ['#key3', 'value3'], ['key4',
'value4'], ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]], ['#key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', [['key1', 'value1']]], ['key7', 'value7'], ['key8', 'null']]]]
[0]:
  ['object1', [['key1', 'value1'], ['key2', 'null'], ['#key3', 'value3'], ['key4', 'value4'], ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]], ['#key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', [['key1', 'value1']]], ['key7', 'value7'], ['key8', 'null']]]
  - obj_name: 'object1'
  - objects: [['key1', 'value1'], ['key2', 'null'], ['#key3', 'value3'], ['key4',
'value4'], ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]], ['#key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', [['key1', 'value1']]], ['key7', 'value7'], ['key8', 'null']]
    [0]:
      ['key1', 'value1']
      - key: 'key1'
      - val: 'value1'
    [1]:
      ['key2', 'null']
      - key: 'key2'
      - val: 'null'
    [2]:
      ['#key3', 'value3']
      - key: '#key3'
      - val: 'value3'
    [3]:
      ['key4', 'value4']
      - key: 'key4'
      - val: 'value4'
    [4]:
      ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]]
      - obj_name: 'subobject1'
      - objects: [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]
        [0]:
          ['key1', 'value1']
          - key: 'key1'
          - val: 'value1'
        [1]:
          ['key2', 'value2']
          - key: 'key2'
          - val: 'value2'
        [2]:
          ['key3', 'value3']
          - key: 'key3'
          - val: 'value3'
    [5]:
      ['#key5', 'value5']
      - key: '#key5'
      - val: 'value5'
    [6]:
      ['key6', 'v_a_l_u_e_6']
      - key: 'key6'
      - val: 'v_a_l_u_e_6'
    [7]:
      ['subobject2', [['key1', 'value1']]]
      - obj_name: 'subobject2'
      - objects: [['key1', 'value1']]
        [0]:
          ['key1', 'value1']
          - key: 'key1'
          - val: 'value1'
    [8]:
      ['key7', 'value7']
      - key: 'key7'
      - val: 'value7'
    [9]:
      ['key8', 'null']
      - key: 'key8'
      - val: 'null'

But I could not setResultsName to the Group instead [0] index:

obj << Group(name_expr('obj_name') + objects('objects'))('section')

returns wrong result.

What is the significance of the '#' character? Does it indicate a comment? Or a special kind of key? — PaulMcG
– PaulMcG, Commented Jun 27, 2018 at 13:54
'#' character means that the key is disabled in the configuration file. Next, I want to check the list of keys and find active and disabled — Vik
– Vik, Commented Jun 27, 2018 at 15:41
@PaulMcG: Maybe I can use pyparsing's SkipTo class? Something like this: https://stackoverflow.com/questions/44890040/pyparsing-a-field-that-may-or-may-not-contain-values? But I still can not understand.. — Vik
– Vik, Commented Jun 29, 2018 at 12:51
Alternative library github.com/chimpler/pyhocon/blob/master/README.md — OneCricketeer
– OneCricketeer, Commented Jun 29, 2018 at 23:18
@cricket_007, thank you, it looks interesting. I'll look at this library — Vik
– Vik, Commented Jul 3, 2018 at 12:35

Giuseppe Cianci · Accepted Answer · 2018-06-29 23:09:02Z

This should help you out. See comments for details.

from pyparsing import *

test_string ='''
object1 {
        key1 = value1
        key2
        #key3 = value3
        key4
        #key5 = value5
        key6 = value6
        subobject1 {
            key1 = value1
            key2 = value2
            key3 = value3
        }
}'''

# interpret inline 'string' as Suppress('string'), 
# instead of LBRACE,EQ,RBRACE,HASH = map(Suppress, '{=}#')
ParserElement.inlineLiteralsUsing(Suppress)  

# be sure to exclude special characters when using printables
name_expr = Word(printables, excludeChars='{}')
key_val_expr = '=' + Word(printables)

# p1('name') is equivalent to p1.setResultsName('name')
# p1 | p2 is equivalent to MatchFirst(p1, p2)
# if lineEnd() matches first, there is no value. 
# then use a parse action to return the string 'NONE' as value instead
# else, match a regular key_value
# also, you have to use Group because key_val_line is a repeating element
key_val_line = Group(name_expr('key') + (lineEnd().setParseAction(lambda t: 'NONE') | key_val_expr)('val'))
key_val_lines = OneOrMore(key_val_line)('key_val_lines')

obj = Forward()
obj << Group(name_expr('obj_name') + '{' + OneOrMore(key_val_lines | obj) + '}')('objects')

parse_results = obj.parseString(test_string)
print(parse_results.dump())

This prints the following:

[['object1', ['key1', 'value1'], ['key2', 'NONE'], ['#key3', 'value3'], ['key4', 'NONE'], ['#key5', 'value5'], ['key6', 'value6'], ['subobject1', ['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]]]
- objects: ['object1', ['key1', 'value1'], ['key2', 'NONE'], ['#key3', 'value3'], ['key4', 'NONE'], ['#key5', 'value5'], ['key6', 'value6'], ['subobject1', ['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]]
  - key_val_lines: [['key1', 'value1'], ['key2', 'NONE'], ['#key3', 'value3'], ['key4', 'NONE'], ['#key5', 'value5'], ['key6', 'value6']]
    [0]:
      ['key1', 'value1']
      - key: 'key1'
      - val: 'value1'
    [1]:
      ['key2', 'NONE']
      - key: 'key2'
      - val: 'NONE'
    [2]:
      ['#key3', 'value3']
      - key: '#key3'
      - val: 'value3'
    [3]:
      ['key4', 'NONE']
      - key: 'key4'
      - val: 'NONE'
    [4]:
      ['#key5', 'value5']
      - key: '#key5'
      - val: 'value5'
    [5]:
      ['key6', 'value6']
      - key: 'key6'
      - val: 'value6'
  - obj_name: 'object1'
  - objects: ['subobject1', ['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]
    - key_val_lines: [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]
      [0]:
        ['key1', 'value1']
        - key: 'key1'
        - val: 'value1'
      [1]:
        ['key2', 'value2']
        - key: 'key2'
        - val: 'value2'
      [2]:
        ['key3', 'value3']
        - key: 'key3'
        - val: 'value3'
    - obj_name: 'subobject1'

Thank you @Jeppi, your solution was very helpful for me. But I changed it a little, because again faced with errors when parsing my data. See Edit 1 for details. Thanks again.

PaulMcG · Accepted Answer · 2018-07-03 14:19:16Z

Recursive parsers are not an easy first start with pyparsing, and your optional bits make things more complicated too. I think this code mostly does what you want - hopefully it will be more meaningful to you now that you have done some of your own wrestling with pyparsing thus far:

import pyparsing as pp

LBRACE, RBRACE, EQ = map(pp.Suppress, "{}=")
# convert parsed '#' to a bool that you can test on
disabled_marker = pp.Literal("#").addParseAction(lambda: True)
identifier = pp.pyparsing_common.identifier
key = identifier()

# try to parse a numeric value first, might be interesting
# pyparsing_common.number will auto-convert string to float or int at parse time,
# so you won't have to detect and do the conversion later
value = pp.pyparsing_common.number | pp.Word(pp.printables)

obj_item = pp.Forward()
obj_expr = pp.Group(identifier("name")
                    + pp.Group(LBRACE
                               + pp.ZeroOrMore(obj_item)
                               + RBRACE)("attributes"))

key_with_value = pp.Group(pp.Optional(disabled_marker)("disabled")
                          + key("key") + EQ + value("value"))
# use empty() to inject a None for the value
key_without_value = pp.Group(pp.Optional(disabled_marker)("disabled")
                             + key("key") 
                             + pp.empty().addParseAction(lambda: [None])("value"))

# now define an item that can be used in an object - this order is important!
obj_item <<= obj_expr | key_with_value | key_without_value

To parse your string2 input:

zz = obj_expr.parseString(string2)
print(zz[0].dump())

Gives:

['object1', [['key1', 'value1'], ['key2', None], [True, 'key3', 'value3'], ['key4', 'value4'], ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]], [True, 'key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', [['key1', 'value1']]], ['key7', 'value7'], ['key8', None]]]
- attributes: [['key1', 'value1'], ['key2', None], [True, 'key3', 'value3'], ['key4', 'value4'], ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]], [True, 'key5', 'value5'], ['key6', 'v_a_l_u_e_6'], ['subobject2', [['key1', 'value1']]], ['key7', 'value7'], ['key8', None]]
  [0]:
    ['key1', 'value1']
    - key: 'key1'
    - value: 'value1'
  [1]:
    ['key2', None]
    - key: 'key2'
    - value: None
  [2]:
    [True, 'key3', 'value3']
    - disabled: True
    - key: 'key3'
    - value: 'value3'
  [3]:
    ['key4', 'value4']
    - key: 'key4'
    - value: 'value4'
  [4]:
    ['subobject1', [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]]
    - attributes: [['key1', 'value1'], ['key2', 'value2'], ['key3', 'value3']]
      [0]:
        ['key1', 'value1']
        - key: 'key1'
        - value: 'value1'
      [1]:
        ['key2', 'value2']
        - key: 'key2'
        - value: 'value2'
      [2]:
        ['key3', 'value3']
        - key: 'key3'
        - value: 'value3'
    - name: 'subobject1'
  [5]:
    [True, 'key5', 'value5']
    - disabled: True
    - key: 'key5'
    - value: 'value5'
  [6]:
    ['key6', 'v_a_l_u_e_6']
    - key: 'key6'
    - value: 'v_a_l_u_e_6'
  [7]:
    ['subobject2', [['key1', 'value1']]]
    - attributes: [['key1', 'value1']]
      [0]:
        ['key1', 'value1']
        - key: 'key1'
        - value: 'value1'
    - name: 'subobject2'
  [8]:
    ['key7', 'value7']
    - key: 'key7'
    - value: 'value7'
  [9]:
    ['key8', None]
    - key: 'key8'
    - value: None
- name: 'object1'

EDIT: I removed the Dict constructs, as they actually make the output more difficult to process.

You are right, parsing the custom data structure was not so simple. I saw many examples and did not find any solutions for me. Your trick with disabled_marker is very cool and separation on the key_with_value and key_without_value looks reasonably. Your tips helped to understand the working of pyparsing. Thank you again.

PaulMcG · Accepted Answer · 2018-07-01 05:41:41Z

0

@Jeppi's answer has some excellent suggestions. I would add:

Word(printables) is always a risky construct, since it will match as much non-whitespace as there is. For instance, if a line contained "color=red" with no spaces, then it would be interpreted as a key "color=red" with no value. You would be better to define key with something like Word(alphanums) or Word(alphas, alphanums+"_"). To allow for the possible leading '#', use Word(alphas+'#', alphanums+"_").
Your idea about conditionalizing the presence of '#' with if LineStart() == HASH is interesting, but not how pyparsing works. At this point in the code, you are still building the parser itself, which happens separate from any input text. The actual determination of whether a particular line starts with '#' happens during parsing, which is done later when your code calls collection.parseString. That is, you build up all the parser bits, and then point them at the source text. Any "if character X is present" logic needs to be represented using some alternation or optional construct in the parser itself, not with Python if-then code.
Consider using pyparsing's Optional class for elements that may or may not be present. This applies to the possible key-value without a value, and could also be another way to handle the possible leading '#' character in key names.

answered Jul 1, 2018 at 5:41

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

1 Comment

Vik Over a year ago

Thank you for your explanations. I will definitely use Word (alphas, alphanums) + special characters. Explanation of # and logic of pyparsing is very useful fo me. I used @Jappy's solution and changed my code, you can also see my Edit 1 and say something about this?

Collectives™ on Stack Overflow

Pyparsing: Parsing nested plaintext data with a key=value

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related