Parsing a lightweight language in Python

Question

Say I define a string in Python like the following:

my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"

I would like to parse that string in Python in a way that allows me to index the different structures of the language.

For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.

For example, the following:

parsing_result['names']

would hold a list of strings: ['name1', 'name2']

whereas parsing_result['options'] would hold a dictionary so that:

parsing_result['something']['options']['opt2'] holds the string "text"
parsing_result['something_else']['options']['opt1'] holds the string "58"

My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?

For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.

Update

It looks like the general right solution is using a parser and a lexer such as ply (thank you @Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?
I found this thread where the following regular expression is provided to partition a string around outer commas:
```
r = re.compile(r'(?:[^,(]|\([^)]*\))+')
r.findall(s)
```
But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.

you need a parser and a lexer ... try ply for python (thats the one i usually use...) ... its a fair bit of work defining a language — Joran Beasley
– Joran Beasley, Commented Jul 30, 2013 at 19:23
If the language is sufficiently lightweight, you can use regular expressions. The language implied by your example is such a language, I believe. — Robᵩ
– Robᵩ, Commented Jul 30, 2013 at 19:29
If the sentences in your language are also sentences in Python, you can use ast.parse(). — Robᵩ
– Robᵩ, Commented Jul 30, 2013 at 19:32
I'm curious as to if this is just a exercise, or are you trying to accomplish a greater goal? — aglassman
– aglassman, Commented Jul 30, 2013 at 19:52
@aglassman. I have to deal with this particular language that someone else has defined. Moving forward, I want to learn to parse my own single-line languages. — Amelio Vazquez-Reina
– Amelio Vazquez-Reina, Commented Jul 30, 2013 at 19:55

Community · Accepted Answer · 2020-06-20 09:12:55Z

6

I highly recommend pyparsing:

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions.

The Python representation of the grammar is quite readable, owing to the self-explanatory class names, and the use of '+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.

Sample code (Hello world from the pyparsing docs):

from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))

Output:

Hello, World! -> ['Hello', ',', 'World', '!']

Edit: Here's a solution to your sample language:

from pyparsing import *
import json

identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)

my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()

def list_to_dict(l):
    d = {}
    for struct in l:
        d[struct[0]] = {}
        for ident in struct[1]:
            if len(ident) == 2:
                d[struct[0]][ident[0]] = ident[1]
            elif len(ident) == 1:
                d[struct[0]][ident[0]] = None
    return d

print json.dumps(list_to_dict(result_list), indent=2)

Output: (pretty printed as JSON)

{
  "something_else": {
    "opt1": "58", 
    "name3": null
  }, 
  "something": {
    "opt1": "2", 
    "opt2": "text", 
    "name2": null, 
    "name1": null
  }
}

Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jul 30, 2013 at 20:58

butch

2,1881 gold badge17 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Amelio Vazquez-Reina Over a year ago

Thanks butch. This looks extremely promising. The example you provided is also very helpful, but I am having a hard time seeing how to define "nested levels", such as the ones in my mini-language (i.e. the items inside vs outside the {})

butch Over a year ago

Added a solution to your sample language. Let me know if you have any questions about specific pyparsing functions or classes used in my solution.

Amelio Vazquez-Reina Over a year ago

Thanks so much. To the best of your knowledge, are there any grammars that can be expressed with ply that cannot be expressed with pyparsing? If not, are there any limitations of pyparsing in relation to ply?

butch Over a year ago

My experience with lex/yacc is not extensive and I haven't played with ply yet, so it's probably best I only comment on pyparsing. I know that pyparsing is capable of parsing any context free grammar that you could express in Backus–Naur Form (BNF). Does that help?

aglassman · Accepted Answer · 2013-07-30 20:03:14Z

2

As stated by @Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.

If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.

If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.

answered Jul 30, 2013 at 20:03

aglassman

2,6631 gold badge19 silver badges32 bronze badges

Comments

nio · Accepted Answer · 2013-07-30 20:39:01Z

Here is a test of regular expression modified to react on {} instead of ():

import re

s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')

print r.findall(s)

You'll get a list of separate 'named blocks' as a result:

`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`

I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:

import re

s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')

rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')

blocks =  r.findall(s)

for block in blocks:
    resb = rblock.match(block)
    blockname = resb.group(1)
    blockargs = resb.group(2)
    print "block name=", blockname
    print "args:"
    for arg in re.split(",", blockargs):
        resp = rparam.match(arg)
        paramname =  resp.group(1)
        paramval = resp.group(3)
        if paramval == None:
            print "param name =\"{0}\" no value".format(paramname)
        else:
            print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))

Collectives™ on Stack Overflow

Parsing a lightweight language in Python

Update

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related