This is my first try at using pyparsing, and I am having a hard time setting it up. I want to use pyparsing to parse lexc files. The lexc format is used to declare a lexicon that is compiled into finite-state transducers.
Special characters:
: divides 'upper' and 'lower' sides of a 'data' declaration
; terminates entry
# reserved LEXICON name. end-of-word or final state
' ' (space) universal delimiter
! introduces comment to the end of the line
< introduces xfst-style regex
> closes xfst-style regex
% escape character: %: %; %# % %! %< %> %%
There are multiple levels to parse.
Universally, anything from unescaped ! to a newline is a comment. This could be handled separately at each level.
At the document level, there are three different sections:
Multichar_Symbols Optional one-time declaration
LEXICON Usually many of these
END Anything after this is ignored
At the Multichar_Symbols level, anything separated by whitespace is a declaration. This section ends at the first declaration of a LEXICON.
Multichar_Symbols the+first-one thesecond_one
third_one ! comment that this one is special
+Pl ! plural
At the LEXICON level, the LEXICON's name is declared as:
LEXICON the_name ! whitespace delimited
After the name declaration, a LEXICON is composed of entries: data continuation ;. The semicolon delimits entries. data is optional.
At the data level, there are three possible forms:
upper:lower,simple(which is exploded toupperandlowerassimple:simple,<xfst-style regex>.
Examples:
! # is a reserved continuation that means "end of word".
dog+Pl:dogs # ; ! upper:lower continuation ;
cat # ; ! automatically exploded to "cat:cat # ;" by interpreter
Num ; ! no data, only a continuation to LEXICON named "Num"
<[1|2|3]+> # ; ! xfst-style regex enclosed in <>
Everything after END is ignored
A complete lexc file might look like this:
! Comments begin with !
! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON)
Multichar_Symbols +A +N +V ! +A is adjectives, +N is nouns, +V is verbs
+Adv ! This one is for adverbs
+Punc ! punctuation
! +Cmpar ! This is broken for now, so I commented it out.
! The bulk of lexc is made of up LEXICONs, which contain entries that point to
! other LEXICONs. "Root" is a reserved lexicon name, and the start state.
! "#" is also a reserved lexicon name, and the end state.
LEXICON Root ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root
big Adj ; ! This
bigly Adv ; ! Not sure if this is a real word...
dog Noun ;
cat Noun ;
crow Noun ;
crow Verb ;
Num ; ! This continuation class generates numbers using xfst-style regex
! NB all the following are reserved characters
sour% cream Noun ; ! escaped space
%: Punctuation ; ! escaped :
%; Punctuation ; ! escaped ;
%# Punctuation ; ! escaped #
%! Punctuation ; ! escaped !
%% Punctuation ; ! escaped %
%< Punctuation ; ! escaped <
%> Punctuation ; ! escaped >
%:%:%::%: # ; ! Should map ::: to :
LEXICON Adj
+A: # ; ! # is a reserved lexicon name which means end-of-word (final state).
! +Cmpar:er # ; ! Broken, so I commented it out.
LEXICON Adv
+Adv: # ;
LEXICON Noun
+N+Sg: # ;
+N+Pl:s # ;
LEXICON Num
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation
# ; ! After the first cycle, this makes sense, but as it is, this is bad.
LEXICON Verb
+V+Inf: # ;
+V+Pres:s # ;
LEXICON Punctuation
+Punc: # ;
END
This text is ignored because it is after END
So there are multiple different levels at which to parse. What is the best way to set this up in pyparsing? Are there any examples of this kind of hierarchical language that I could follow as a model?