2

Say I define a string in Python like the following:

my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"

I would like to parse that string in Python in a way that allows me to index the different structures of the language.

For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.

For example, the following:

parsing_result['names'] 

would hold a list of strings: ['name1', 'name2']

whereas parsing_result['options'] would hold a dictionary so that:

  • parsing_result['something']['options']['opt2'] holds the string "text"
  • parsing_result['something_else']['options']['opt1'] holds the string "58"

My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?

For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.

Update

  1. It looks like the general right solution is using a parser and a lexer such as ply (thank you @Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?

  2. I found this thread where the following regular expression is provided to partition a string around outer commas:

    r = re.compile(r'(?:[^,(]|\([^)]*\))+')
    r.findall(s)
    

    But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.

6
  • 2
    you need a parser and a lexer ... try ply for python (thats the one i usually use...) ... its a fair bit of work defining a language Commented Jul 30, 2013 at 19:23
  • 1
    If the language is sufficiently lightweight, you can use regular expressions. The language implied by your example is such a language, I believe. Commented Jul 30, 2013 at 19:29
  • 1
    If the sentences in your language are also sentences in Python, you can use ast.parse(). Commented Jul 30, 2013 at 19:32
  • I'm curious as to if this is just a exercise, or are you trying to accomplish a greater goal? Commented Jul 30, 2013 at 19:52
  • @aglassman. I have to deal with this particular language that someone else has defined. Moving forward, I want to learn to parse my own single-line languages. Commented Jul 30, 2013 at 19:55

3 Answers 3

6

I highly recommend pyparsing:

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions.

The Python representation of the grammar is quite readable, owing to the self-explanatory class names, and the use of '+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.

Sample code (Hello world from the pyparsing docs):

from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))

Output:

Hello, World! -> ['Hello', ',', 'World', '!']

Edit: Here's a solution to your sample language:

from pyparsing import *
import json

identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)

my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()

def list_to_dict(l):
    d = {}
    for struct in l:
        d[struct[0]] = {}
        for ident in struct[1]:
            if len(ident) == 2:
                d[struct[0]][ident[0]] = ident[1]
            elif len(ident) == 1:
                d[struct[0]][ident[0]] = None
    return d

print json.dumps(list_to_dict(result_list), indent=2)

Output: (pretty printed as JSON)

{
  "something_else": {
    "opt1": "58", 
    "name3": null
  }, 
  "something": {
    "opt1": "2", 
    "opt2": "text", 
    "name2": null, 
    "name1": null
  }
}

Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks butch. This looks extremely promising. The example you provided is also very helpful, but I am having a hard time seeing how to define "nested levels", such as the ones in my mini-language (i.e. the items inside vs outside the {})
Added a solution to your sample language. Let me know if you have any questions about specific pyparsing functions or classes used in my solution.
Thanks so much. To the best of your knowledge, are there any grammars that can be expressed with ply that cannot be expressed with pyparsing? If not, are there any limitations of pyparsing in relation to ply?
My experience with lex/yacc is not extensive and I haven't played with ply yet, so it's probably best I only comment on pyparsing. I know that pyparsing is capable of parsing any context free grammar that you could express in Backus–Naur Form (BNF). Does that help?
2

As stated by @Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.

If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.

If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.

Comments

2

Here is a test of regular expression modified to react on {} instead of ():

import re

s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')

print r.findall(s)

You'll get a list of separate 'named blocks' as a result:

`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`

I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:

import re

s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')

rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')

blocks =  r.findall(s)

for block in blocks:
    resb = rblock.match(block)
    blockname = resb.group(1)
    blockargs = resb.group(2)
    print "block name=", blockname
    print "args:"
    for arg in re.split(",", blockargs):
        resp = rparam.match(arg)
        paramname =  resp.group(1)
        paramval = resp.group(3)
        if paramval == None:
            print "param name =\"{0}\" no value".format(paramname)
        else:
            print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.