4

I have two somewhat related questions regarding parsing a JSON like data format using pyparsing. The goal is to parse this data and convert the result to JSON.

1) The first type data looks like

mystr = """
    DataName = {
        fieldA = {
            fieldB = 10
            fieldC = "absf"
        }
    }
    DataName = {
        fieldA = {
            fieldB = 11
            fieldC = "bsf"
        }
    }
"""

I'm wondering what the best way to set up the grammar is, in order to parse mystr into a list of dictionaries that would look like

expected_result = [{"DataName": {"fieldA": {"fieldB": 10, "fieldC": "absf"}}},
                   {"DataName": {"fieldA": {"fieldB": 11, "fieldC": "bsf"}}}]

My first attempt is as follows

from pyparsing import *
LBRACE, RBRACE, EQUAL = map(Suppress, "{}=")
field = Word(alphas + '[]')
string = dblQuotedString().setParseAction(removeQuotes)
number = pyparsing_common.number()

value = (string | number)
jobject = Forward()
memberDef = Group(field + EQUAL + value)
members = delimitedList(memberDef ^ jobject, delim=LineEnd())
jobject << Dict(field + EQUAL + LBRACE + Optional(members) + RBRACE)

members.parseString(mystr)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-70cbdee9640b> in <module>()
----> 1 members.parseString(mystr)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseString(self, instring, parseAll)
   1204             instring = instring.expandtabs()
   1205         try:
-> 1206             loc, tokens = self._parse( instring, 0 )
   1207             if parseAll:
   1208                 loc = self.preParse( instring, loc )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2525         # pass False as last arg to _parse for first element, since we already
   2526         # pre-parsed the string as part of our And pre-parsing
-> 2527         loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
   2528         errorStop = False
   2529         for e in self.exprs[1:]:

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2587         for e in self.exprs:
   2588             try:
-> 2589                 loc2 = e.tryParse( instring, loc )
   2590             except ParseException as err:
   2591                 err.__traceback__ = None

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in tryParse(self, instring, loc)
   1112     def tryParse( self, instring, loc ):
   1113         try:
-> 1114             return self._parse( instring, loc, doActions=False )[0]
   1115         except ParseFatalException:
   1116             raise ParseException( instring, loc, self.errmsg, self)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2799     def parseImpl( self, instring, loc, doActions=True ):
   2800         if self.expr is not None:
-> 2801             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   2802         else:
   2803             raise ParseException("",loc,self.errmsg,self)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2799     def parseImpl( self, instring, loc, doActions=True ):
   2800         if self.expr is not None:
-> 2801             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   2802         else:
   2803             raise ParseException("",loc,self.errmsg,self)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2542                     raise ParseSyntaxException( ParseException(instring, len(instring), self.errmsg, self) )
   2543             else:
-> 2544                 loc, exprtokens = e._parse( instring, loc, doActions )
   2545             if exprtokens or exprtokens.haskeys():
   2546                 resultlist += exprtokens

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3004     def parseImpl( self, instring, loc, doActions=True ):
   3005         try:
-> 3006             loc, tokens = self.expr._parse( instring, loc, doActions, callPreParse=False )
   3007         except (ParseException,IndexError):
   3008             if self.defaultValue is not _optionalNotMatched:

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2525         # pass False as last arg to _parse for first element, since we already
   2526         # pre-parsed the string as part of our And pre-parsing
-> 2527         loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
   2528         errorStop = False
   2529         for e in self.exprs[1:]:

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2587         for e in self.exprs:
   2588             try:
-> 2589                 loc2 = e.tryParse( instring, loc )
   2590             except ParseException as err:
   2591                 err.__traceback__ = None

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in tryParse(self, instring, loc)
   1112     def tryParse( self, instring, loc ):
   1113         try:
-> 1114             return self._parse( instring, loc, doActions=False )[0]
   1115         except ParseFatalException:
   1116             raise ParseException( instring, loc, self.errmsg, self)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1070             if self.mayIndexError or loc >= len(instring):
   1071                 try:
-> 1072                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1073                 except IndexError:
   1074                     raise ParseException( instring, len(instring), self.errmsg, self )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2799     def parseImpl( self, instring, loc, doActions=True ):
   2800         if self.expr is not None:
-> 2801             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   2802         else:
   2803             raise ParseException("",loc,self.errmsg,self)

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1076                 loc,tokens = self.parseImpl( instring, preloc, doActions )
   1077
-> 1078         tokens = self.postParse( instring, loc, tokens )
   1079
   1080         retTokens = ParseResults( tokens, self.resultsName, asList=self.saveAsList, modal=self.modalResults )

/home/matthew/anaconda3/lib/python3.5/site-packages/pyparsing.py in postParse(self, instring, loc, tokenlist)
   3247                 tokenlist[ikey] = _ParseResultsWithOffset(tok[1],i)
   3248             else:
-> 3249                 dictvalue = tok.copy() #ParseResults(i)
   3250                 del dictvalue[0]
   3251                 if len(dictvalue)!= 1 or (isinstance(dictvalue,ParseResults) and dictvalue.haskeys()):

AttributeError: 'str' object has no attribute 'copy'

This does not work, however I am unclear why. mystr is a delimitedList of two jobjects (DataNames) where each jobject contains 1 jobject (fieldA) which is comprised of 1 members which has two memberDefs. What am I missing here?

Alternatively, I could define my grammar as follows

value = Forward()
jobject = Forward()
value << (string | number | Group(jobject))
memberDef = Group(field + EQUAL + value)
members = delimitedList(memberDef, delim=LineEnd())
jobject << Dict(LBRACE + Optional(members) + RBRACE)
res = members.parseString(mystr)

I can then iterate through the results and generate dictionaries, however this feels like a bit of a kludge.

list_of_dicts = []
for pair in res:
    list_of_dicts.append({pair[0]: pair[1].asDict()})

print(list_of_dicts)
[{'DataName': {'fieldA': {'fieldC': 'absf', 'fieldB': 10.0}}}, {'DataName': {'fieldA': {'fieldC': 'bsf', 'fieldB': 11.0}}}]

2) The data format also includes text like the following.

mystr2 = """
fieldA = {
    someFieldA[] = {
    }
    someFieldB[] = {
        "typeA", "typeB"
    }
    someFieldC[] = {
        fieldData = {
            data = 10
        }
        fieldData = {
            data = 12
        }
    }
    someFieldD = "bsf"
}
fieldA = {
}
"""

I would like to parse this into a list of dictionaries as follows

expected_result2 = [{"fieldA": {"someFieldA": [],
                               "someFieldB": ["typeA", "typeB"],
                               "someFieldC":[{"fieldData": {"data": 10}},
                                             {"fieldData": {"data": 10}}],
                               "someFieldD": "bsf"}},
                    {"fieldA": {}}]

I attempted to address this by adding an array type to the grammar

value = Forward()
jobject = Forward()

arrayElements = delimitedList(string)
array = Group(LBRACE + Optional(arrayElements, []) + RBRACE)

value << (string | number | Group(jobject) | array)
memberDef = Group(field + EQUAL + value)
members = delimitedList(memberDef, delim=LineEnd())
jobject << Dict(LBRACE + Optional(members) + RBRACE)

res2 = members.parseString(mystr2)
print(res2)
 [['fieldA', [['someFieldA[]', []], ['someFieldB', ['typeA', 'typeB']], ['someFieldC[]', [['fieldData', [['data', 10.0]]], ['fieldData', [['data', 12.0]]]]], ['someFieldD', 'bsf']]], ['fieldA', []]]

This returns a parseResult however I am unsure how to go about transforming that into something like expected_result2. In addition, there is nothing in the grammar above to distinguish between elements of the form

Data = {
}

and

Data[] = {
}

which should map to {"Data": {}} and {"Data": []} respectively.

Edit

There was a typo in mystr2 above, someFieldB[] = { had been improperly written as someFieldB = {

A grammar which accounts for the significance of [], is shown below.

LBRACE, RBRACE, EQUAL = map(Suppress, "{}=")
field = Word(alphas)
string = dblQuotedString().setParseAction(removeQuotes)
number = pyparsing_common.number()
scalar_value = (string | number)

value_list = Forward()
jobject = Forward()

memberDef1 = Group(field + EQUAL + scalar_value)
memberDef2 = Group(field + EQUAL + jobject)
memberDef3 = Group(field + "[]" + EQUAL + LBRACE + value_list + RBRACE)
memberDef = memberDef1 | memberDef2 | memberDef3

value_list << (delimitedList(string, ",") | ZeroOrMore(memberDef2))
members = delimitedList(memberDef, delim=LineEnd())
jobject << Dict(LBRACE + Optional(members, '{}') + RBRACE)
res = members.parseString(mystr2)

which appears to properly parse, however I am still unclear how I would go about transforming res into a list of dictionaries?

Edit 2

An actual example illustrating the grammar is included below

HistoricalDataRequest = {
    securities[] = {
        "SPY US Equity", "TLT US Equity"
    }
    fields[] = {
        "PX_LAST"
    }
    startDate = "20150629"
    endDate = "20150630"
    overrides[] = {
    }
}

HistoricalDataResponse = {
    securityData = {
        security = "SPY US Equity"
        eidData[] = {
        }
        sequenceNumber = 0
        fieldExceptions[] = {
        }
        fieldData[] = {
            fieldData = {
                date = 2015-06-29
                PX_LAST = 205.420000
            }
            fieldData = {
                date = 2015-06-30
                PX_LAST = 205.850000
            }
        }
    }
}

HistoricalDataResponse = {
    securityData = {
        security = "TLT US Equity"
        eidData[] = {
        }
        sequenceNumber = 1
        fieldExceptions[] = {
        }
        fieldData[] = {
            fieldData = {
                date = 2015-06-29
                PX_LAST = 118.280000
            }
            fieldData = {
                date = 2015-06-30
                PX_LAST = 117.460000
            }
        }
    }
}
7
  • There is inconsistency between the original data and what you want to have as output. On one hand, you want to convert fieldB = 10\n fieldC = "absf" to a dictionary. On the other hand, you want to convert an identical construct DataName = {...}\n DataName = {...} into a list of one-element dictionaries. Does it happen that DataName indeed gets repeated, or is it DataNameA, DataNameB in the original data? Commented May 23, 2017 at 21:16
  • Ah, actually I see now: the first example is a subcase of the second example, with the root element sort of having [] at the end. Commented May 23, 2017 at 21:22
  • Yes that's correct, sorry for the ambiguity. The [] indicates a list, which from analyzing the format can either be a comma separated list of strings as in someFieldB = {\n"typeA", "typeB"\n} or a list of \n separated dictionaries as in someFieldC[] = {\n fieldData = {\n data = 10\n }\n fieldData = {\n data = 12\n}\n Commented May 23, 2017 at 21:46
  • If the "[]" suffix is significant for determining the definition of an array of values, then you shouldn't bury it in the definition of field. Instead, declare field + EQUAL + value separate from field + "[]" + EQUAL + value_list. Then define value_list as delimitedList(scalarValue) | OneOrMore(jobject) where scalar_value = string | number Commented May 23, 2017 at 23:28
  • @PaulMcGuire Thanks, great library by the way. I have edited my question to include your feedback. Is there a way to allow value_list to be empty and to set the default in this case to []? i.e. strings of the form myValue[] = {}? I attempted to use ZeroOrMore instead of OneOrMore but this doesn't appear to support a default parameter? Commented May 24, 2017 at 0:57

1 Answer 1

2

Ok, with some finagling and shenanigans, I think I have contrived a parser that can give you JSON-able dicts from this format.

LBRACE, RBRACE, EQUAL = map(Suppress, "{}=")
field = Word(alphas, alphas+'_')
# was field = Word(alphas)
string = dblQuotedString().setParseAction(removeQuotes)
number = pyparsing_common.number()
date_expr = Regex(r'\d\d\d\d-\d\d-\d\d')
scalar_value = (string | date_expr | number)
# was scalar_value = (string | number)

list_marker = Suppress("[]")
value_list = Forward()
jobject = Forward()

memberDef1 = Group(field + EQUAL + scalar_value)
memberDef2 = Group(field + EQUAL + jobject)
memberDef3 = Group(field + list_marker + EQUAL + LBRACE + value_list + RBRACE)
memberDef = memberDef1 | memberDef2 | memberDef3

value_list <<= (delimitedList(scalar_value, ",") | ZeroOrMore(Group(Dict(memberDef2))))
value_list.setParseAction(lambda t: [ParseResults(t[:])])

members = OneOrMore(memberDef)
jobject <<= Dict(LBRACE + ZeroOrMore(memberDef) + RBRACE)
# force empty jobject to be a dict
jobject.setParseAction(lambda t: t or {})

parser = members
parser = OneOrMore(Group(Dict(memberDef)))

tests = [mystr, mystr2]

import pprint
import json
for test in tests:
    print(test)
    res = parser.parseString(test)
    for res_dict in res:
        pprint.pprint(res_dict.asDict())
        # or convert to JSON using:
        # print(json.dumps(res_dict.asDict(), indent=2))
    print('')

prints (adding empty jobject for someFieldE and empty list for someFieldF):

{'DataName': {'fieldA': {'fieldB': 10, 'fieldC': 'absf'}}}
{'DataName': {'fieldA': {'fieldB': 11, 'fieldC': 'bsf'}}}

{'fieldA': {'someFieldA': [],
            'someFieldB': ['typeA', 'typeB'],
            'someFieldC': [{'a': {'data': 10}}, {'a': {'data': 12}}],
            'someFieldD': 'bsf',
            'someFieldE': {},
            'someFieldF': []}}

I worked around the multiple dict keys using Group's around Dict's, so that the duplicate keys would be isolated into separate ParseResults. The parse action on value_list is there so that empty lists return empty ParseResults in a list. I had to force empty jobjects to become dicts, because leaving them as empty ParseResults will not have any keys, and so won't return a dict from asDict().

(Edit: To accommodate your posted example, I had to add '_' as a valid field name character, and also define a new date_expr type for the date-like field values.)

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for this very detailed response. Upon further inspection of my data, I realised a field can actually contain spaces as well, e.g. Security Description = "COQ7 Comdty". I attempted to solve this adding a ' ' to the bodyChars of Word, i.e. field = Word(alphas, alphas+'_'+' ') however this then matches the trailing space before the =. I have been fumbling around with Regex but can't seem to get the right expression
Probably the simplest is to add str.rstrip as a parse action to field to strip the trailing spaces: field = Word(printables+' ', excludeChars='='); field.addParseAction(tokenMap(str.rstrip))
Or do this to convert to nice Pythonic identifiers: field.addParseAction(tokenMap(str.rstrip), tokenMap(str.lower), tokenMap(str.replace, ' ', '_'))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.