Using python (pyparsing) to parse lexc

Question

This is my first try at using pyparsing, and I am having a hard time setting it up. I want to use pyparsing to parse lexc files. The lexc format is used to declare a lexicon that is compiled into finite-state transducers.

Special characters:

:    divides 'upper' and 'lower' sides of a 'data' declaration
;    terminates entry
#    reserved LEXICON name. end-of-word or final state
' '  (space) universal delimiter
!    introduces comment to the end of the line
<    introduces xfst-style regex
>    closes xfst-style regex
%    escape character: %: %; %# %  %! %< %> %%

There are multiple levels to parse.

Universally, anything from unescaped ! to a newline is a comment. This could be handled separately at each level.

At the document level, there are three different sections:

Multichar_Symbols    Optional one-time declaration
LEXICON              Usually many of these
END                  Anything after this is ignored

At the Multichar_Symbols level, anything separated by whitespace is a declaration. This section ends at the first declaration of a LEXICON.

Multichar_Symbols the+first-one thesecond_one
third_one ! comment that this one is special
+Pl       ! plural

At the LEXICON level, the LEXICON's name is declared as:

LEXICON the_name ! whitespace delimited

After the name declaration, a LEXICON is composed of entries: data continuation ;. The semicolon delimits entries. data is optional.

At the data level, there are three possible forms:

upper:lower,
simple (which is exploded to upper and lower as simple:simple,
<xfst-style regex>.

Examples:

! # is a reserved continuation that means "end of word".
dog+Pl:dogs # ;  ! upper:lower continuation ;
cat # ;          ! automatically exploded to "cat:cat # ;" by interpreter
Num ;            ! no data, only a continuation to LEXICON named "Num"
<[1|2|3]+> # ;    ! xfst-style regex enclosed in <>

Everything after END is ignored

A complete lexc file might look like this:

! Comments begin with !

! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON)
Multichar_Symbols +A +N +V  ! +A is adjectives, +N is nouns, +V is verbs
+Adv  ! This one is for adverbs
+Punc ! punctuation
! +Cmpar ! This is broken for now, so I commented it out.

! The bulk of lexc is made of up LEXICONs, which contain entries that point to
! other LEXICONs. "Root" is a reserved lexicon name, and the start state.
! "#" is also a reserved lexicon name, and the end state.

LEXICON Root  ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root
big Adj ;  ! This 
bigly Adv ;  ! Not sure if this is a real word...
dog Noun ;
cat Noun ;
crow Noun ;
crow Verb ;
Num ;        ! This continuation class generates numbers using xfst-style regex

! NB all the following are reserved characters

sour% cream Noun ;  ! escaped space
%: Punctuation ;    ! escaped :
%; Punctuation ;    ! escaped ;
%# Punctuation ;    ! escaped #
%! Punctuation ;    ! escaped !
%% Punctuation ;    ! escaped %
%< Punctuation ;    ! escaped <
%> Punctuation ;    ! escaped >

%:%:%::%: # ; ! Should map ::: to :

LEXICON Adj
+A: # ;      ! # is a reserved lexicon name which means end-of-word (final state).
! +Cmpar:er # ;  ! Broken, so I commented it out.

LEXICON Adv
+Adv: # ;

LEXICON Noun
+N+Sg: # ;
+N+Pl:s # ;

LEXICON Num
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation
# ; ! After the first cycle, this makes sense, but as it is, this is bad.

LEXICON Verb
+V+Inf: # ;
+V+Pres:s # ;

LEXICON Punctuation
+Punc: # ;

END

This text is ignored because it is after END

So there are multiple different levels at which to parse. What is the best way to set this up in pyparsing? Are there any examples of this kind of hierarchical language that I could follow as a model?

PaulMcG · Accepted Answer · 2017-03-17 02:56:01Z

The strategy when using pyparsing is to break up the parsing problem into small parts, and then compose them into the larger ones.

Beginning with your first high-level structure definition:

Multichar_Symbols    Optional one-time declaration
LEXICON              Usually many of these
END                  Anything after this is ignored

your eventual overall parser will look like:

parser = (Optional(multichar_symbols_section)('multichar_symbols')
          + Group(OneOrMore(lexicon_section))('lexicons') 
          + END)

The names in parentheses after each part will give us labels to make it easy to access the different parts of the parsed results.

Going into deeper detail, let's look at how to define the parser for the lexicon_section.

First define punctuation and special keywords

COLON,SEMI = map(Suppress, ":;")
HASH = Literal('#')
LEXICON, END = map(Keyword, "LEXICON END".split())

Your identifiers and values can contain '%'-escaped characters, so we need to build them up from pieces:

# use regex and Combine to handle % escapes
escaped_char = Regex(r'%.').setParseAction(lambda t: t[0][1])
ident_lit_part = Word(printables, excludeChars=':%;')
xfst_regex = Regex(r'<.*?>')
ident = Combine(OneOrMore(escaped_char | ident_lit_part)) | xfst_regex
value_expr = ident()

With these pieces, we can now define an individual lexicon declaration:

# handle the following lexicon declarations:
#    name ;
#    name:value ;
#    name value ;
#    name value # ;
lexicon_decl = Group(ident("name") 
                     + Optional(Optional(COLON) 
                                + value_expr("value") 
                                + Optional(HASH)('hash'))
                     + SEMI)

This part is a little messy, it turns out that value can be returned as a string, a results structure (a pyparsing ParseResults), or might even be missing entirely. We can use a parse action to normalize all these forms into just a single string form.

# use a parse action to normalize the parsed values
def fixup_value(tokens):
    if 'value' in tokens[0]:
        # pyparsing makes this a nested element, just take zero'th value
        if isinstance(tokens[0].value, ParseResults):
            tokens[0]['value'] = tokens[0].value[0]
    else:
        # no value was given, expand 'name' as if  parsed 'name:name'
        tokens[0]['value'] = tokens[0].name
lexicon_decl.setParseAction(fixup_value)

Now the value will be cleaned up at parse time, so no additional code needed after calling parseString.

We are finally ready to define a whole LEXICON section:

# TBD - make name optional, define as 'Root'
lexicon_section = Group(LEXICON 
                        + ident("name") 
                        + ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations"))

A last bit of housekeeping - we need to ignore comments. We can call ignore on the top-most parser expression, and comments will be ignored throughout the entire parser:

# ignore comments anywhere in our parser
comment = '!' + Optional(restOfLine)
parser.ignore(comment)

Here is all that code in a single copy-pasteable section:

import pyparsing as pp

# define punctuation and special words
COLON,SEMI = map(pp.Suppress, ":;")
HASH = pp.Literal('#')
LEXICON, END = map(pp.Keyword, "LEXICON END".split())

# use regex and Combine to handle % escapes
escaped_char = pp.Regex(r'%.').setParseAction(lambda t: t[0][1])
ident_lit_part = pp.Word(pp.printables, excludeChars=':%;')
xfst_regex = pp.Regex(r'<.*?>')
ident = pp.Combine(pp.OneOrMore(escaped_char | ident_lit_part | xfst_regex))
value_expr = ident()


# handle the following lexicon declarations:
#    name ;
#    name:value ;
#    name value ;
#    name value # ;
lexicon_decl = pp.Group(ident("name") 
                     + pp.Optional(pp.Optional(COLON) 
                                + value_expr("value") 
                                + pp.Optional(HASH)('hash'))
                     + SEMI)

# use a parse action to normalize the parsed values
def fixup_value(tokens):
    if 'value' in tokens[0]:
        # pyparsing makes this a nested element, just take zero'th value
        if isinstance(tokens[0].value, pp.ParseResults):
            tokens[0]['value'] = tokens[0].value[0]
    else:
        # no value was given, expand 'name' as if  parsed 'name:name'
        tokens[0]['value'] = tokens[0].name
lexicon_decl.setParseAction(fixup_value)

# define a whole LEXICON section
# TBD - make name optional, define as 'Root'
lexicon_section = pp.Group(LEXICON 
                        + ident("name") 
                        + pp.ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations"))

# this part still TBD - just put in a placeholder for now
multichar_symbols_section = pp.empty()

# tie it all together
parser = (pp.Optional(multichar_symbols_section)('multichar_symbols')
          + pp.Group(pp.OneOrMore(lexicon_section))('lexicons') 
          + END)

# ignore comments anywhere in our parser
comment = '!' + pp.Optional(pp.restOfLine)
parser.ignore(comment)

Parsing your posted 'Root' sample, we can dump the results using dump()

result = lexicon_section.parseString(lexicon_sample)[0]
print(result.dump())

Giving:

['LEXICON', 'Root', ['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']]
- declarations: [['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']]
  [0]:
    ['big', 'Adj']
    - name: 'big'
    - value: 'Adj'
  [1]:
    ['bigly', 'Adv']
    - name: 'bigly'
    - value: 'Adv'
  [2]:
    ['dog', 'Noun']
    - name: 'dog'
    - value: 'Noun'
  ...
  [13]:
    ['<', 'Punctuation']
    - name: '<'
    - value: 'Punctuation'
  [14]:
    ['>', 'Punctuation']
    - name: '>'
    - value: 'Punctuation'
  [15]:
    [':::', ':', '#']
    - hash: '#'
    - name: ':::'
    - value: ':'
- name: 'Root'

This code shows how to iterate over the parts of the section and getting the named parts:

# try out a lexicon against the posted sample
result = lexicon_section.parseString(lexicon_sample)[0]
print(result.dump())

print('Name:', result.name)
print('\nDeclarations')
for decl in result.declarations:
    print("{name} -> {value}".format_map(decl), "(END)" if decl.hash else '')

Giving:

Name: Root

Declarations
big -> Adj 
bigly -> Adv 
dog -> Noun 
cat -> Noun 
crow -> Noun 
crow -> Verb 
Num -> Num 
sour cream -> Noun 
: -> Punctuation 
; -> Punctuation 
# -> Punctuation 
! -> Punctuation 
% -> Punctuation 
< -> Punctuation 
> -> Punctuation 
::: -> : (END)

Hopefully this will give you enough to take it from here.

Wow! This is a much more thorough answer than I expected! I will have more time to look at it on Monday. Thanks!
I don't understand what the line value_expr = ident() is doing. What is the difference b/w ident and value_expr? They both appear to be the same kind of object.
It's a fine distinction, value_expr = ident would do just as well. The difference is that ident() returns a copy of ident (short form of value_expr = ident.copy()), so if you wanted to attach a parse action or some other feature to an-ident-expression-that-is-used-as-a-right-hand-side-value, then you could safely do it on value_expr and ident would not be affected.

Collectives™ on Stack Overflow

Using python (pyparsing) to parse lexc

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related