converting a string to a tree structure in python

Question

I have a string in python of the following form:

line a
line b
  line ba
  line bb
    line bba
  line bc
line c
  line ca
    line caa
line d

You can get the idea. It actually takes a very similar form to python code itself, in that there is a line, and below that line, indentations indicate part of a block, headed by the most recent line of a lesser indentation.

What I need to do is parse this code into a tree sructure, such that each root level line is the key of a dictionary, and its value is a dictionary representing all sublines. so the above would be:

{
'line a' => {},
'line b' => {
  'line ba' => {},
  'line bb' => {
    'line bba' => {}
    },
  'line bc' => {}
  },
'line c' => {
  'line ca' => {
    'line caa' => {}
    },
  },
'line d' => {}
}

here's what I've got:

def parse_message_to_tree(message):
    buf = StringIO(message)
    return parse_message_to_tree_helper(buf, 0)

def parse_message_to_tree_helper(buf, prev):
    ret = {}
    for line in buf:
        line = line.rstrip()
        index = len(line) - len(line.lstrip())
        print (line + " => " + str(index))
        if index > prev:
            ret[line.strip()] = parse_message_to_tree_helper(buf, index)
        else:
            ret[line.strip()] = {}

    return ret

The print shows lines that are left stripped and indexes of 0. I didn't think lstrip() was a mutator, but either way the index should still be accurate.

Any suggestions are helpful.

EDIT: Not sure what was going wrong before, but I tried again and it is closer to working, but still not quite right. Here's what I have now:

{'line a': {},
 'line b': {},
 'line ba': {'line bb': {},
             'line bba': {'line bc': {},
                          'line c': {},
                          'line ca': {},
                          'line caa': {},
                          'line d': {}}}}

Have you seen the autovivificating tree hack yet? Might save you some keystrokes: tree = lambda: defaultdict(tree); t = tree(); t['a']['b']['c'] = "bla" — a p
– a p, Commented Aug 19, 2015 at 17:46

Anand S Kumar · Accepted Answer · 2015-08-19 17:53:18Z

Like already noted before str.lstrip() is not a mutator, The index is coming accurate in my system as well.

But the issue is that by the time you realize that the index for the line has increased, line is actually point to the increased index line , example , in first case, we note that index for line increases at line ba , so line points to line ba , and then in your if condition , you do -

ret[line.strip()] = parse_message_to_tree_helper(buf, index)

This is wrong, because you would be setting whatever is returned by parse_message_to_tree_helper() to line ba , not its actual parent.

Also, once you recurse inside the function, you do not come out unless the file has completely been read, but the level in which a certain line is stored in dictionary depends on it coming out of recursion when the indentation has decreased.

I am not sure, if there are any inbuilt libraries that will help you do this , but a code that I was able to come up with (based a lot on your code) -

def parse_message_to_tree(message):
    buf = StringIO(message)
    return parse_message_to_tree_helper(buf, 0, None)[0]

def parse_message_to_tree_helper(buf, prev, prevline):
    ret = {}
    index = -1
    for line in buf:
        line = line.rstrip()
        index = len(line) - len(line.lstrip())
        print (line + " => " + str(index))
        if index > prev:
            ret[prevline.strip()],prevline,index = parse_message_to_tree_helper(buf, index, line)
            if index < prev:
                return ret,prevline,index
            continue
        elif not prevline:
            ret[line.strip()] = {}
        else:
            ret[prevline.strip()] = {}
        if index < prev:
            return ret,line,index
        prevline = line
    if index == -1:
        ret[prevline.strip()] = {}
        return ret,None,index
    if prev == index:
        ret[prevline.strip()] = {}
    return ret,None,0

Example/Demo -

>>> print(s)
line a
line b
  line ba
  line bb
    line bba
  line bc
line c
  line ca
    line caa
>>> def parse_message_to_tree(message):
...     buf = StringIO(message)
...     return parse_message_to_tree_helper(buf, 0, None)[0]
...
>>> def parse_message_to_tree_helper(buf, prev, prevline):
...     ret = {}
...     index = -1
...     for line in buf:
...         line = line.rstrip()
...         index = len(line) - len(line.lstrip())
...         print (line + " => " + str(index))
...         if index > prev:
...             ret[prevline.strip()],prevline,index = parse_message_to_tree_helper(buf, index, line)
...             if index < prev:
...                 return ret,prevline,index
...             continue
...         elif not prevline:
...             ret[line.strip()] = {}
...         else:
...             ret[prevline.strip()] = {}
...         if index < prev:
...             return ret,line,index
...         prevline = line
...     if index == -1:
...         ret[prevline.strip()] = {}
...         return ret,None,index
...     if prev == index:
...         ret[prevline.strip()] = {}
...     return ret,None,0
...
>>> pprint.pprint(parse_message_to_tree(s))
line a => 0
line b => 0
  line ba => 2
  line bb => 2
    line bba => 4
  line bc => 2
line c => 0
  line ca => 2
    line caa => 4
{'line a': {},
 'line b': {'line ba': {}, 'line bb': {'line bba': {}}, 'line bc': {}},
 'line c': {'line ca': {'line caa': {}}}}
>>> s = """line a
... line b
...   line ba
...   line bb
...     line bba
...   line bc
... line c
...   line ca
...     line caa
... line d"""
>>> pprint.pprint(parse_message_to_tree(s))
line a => 0
line b => 0
  line ba => 2
  line bb => 2
    line bba => 4
  line bc => 2
line c => 0
  line ca => 2
    line caa => 4
line d => 0
{'line a': {},
 'line b': {'line ba': {}, 'line bb': {'line bba': {}}, 'line bc': {}},
 'line c': {'line ca': {'line caa': {}}},
 'line d': {}}

You would need to test the code for any more bugs or some missed cases.

thanks. I'd figured out that I needed to include the previous line in the parameters, but I hadn't gotten to the pop out issue yet.

gmoshkin · Accepted Answer · 2015-08-19 17:24:06Z

1

lstrip() isn't a mutator, see documentation:

string.lstrip(s[, chars])

Return a copy of the string with leading characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the beginning of the string this method is called on.

And your code seems to work with that sample text on my machine.

answered Aug 19, 2015 at 17:24

gmoshkin

1,1558 silver badges23 bronze badges

2 Comments

ewok Over a year ago

odd. It wasnt working before but it does seem to be closer to working now. The issue I'm having now is that it doesn't "pop back up" on line c and line d. so line c is in line bba's map

ewok Over a year ago

actually its more poorly structured than that:

{'line a': {}, 'line b': {}, 'line ba': {'line bb': {}, 'line bba': {'line bc': {}, 'line ca': {}, 'line c': {}, 'line d': {}, 'line caa': {}}}}

Rishi · Accepted Answer · 2015-08-19 18:44:54Z

Another answer, using stack instead of recursion. It took a few iterations to get to this version, and it seems to handle several possible input scenarios, but can't guarantee total absence of bugs! It's a tricky problem indeed. Hopefully my comments illustrate a correct line of thought. Thanks for sharing the problem.

text = '''line a
line b
  line ba
  line bb
    line bba
  line bc
line c
  line ca
    line caa
line d'''

root_tree = {}
stack = []
prev_indent, prev_tree = -1, root_tree

for line in text.splitlines():

    # compute current line's indent and strip the line
    origlen = len(line)
    line = line.lstrip()
    indent = origlen - len(line)
    print indent, line

    # no matter what, every line has its own tree, so let's create it.
    tree = {}  

    # where to attach this new tree is dependent on indent, prev_indent
    # assume: stack[-1] was the right attach point for the previous line
    # then: let's adjust the stack to make that true for the current line

    if indent < prev_indent:
        while stack[-1][0] >= indent:
            stack.pop()
    elif indent > prev_indent:
        stack.append((prev_indent, prev_tree))

    # at this point: stack[-1] is the right attach point for the current line
    parent_indent, parent_tree = stack[-1]
    assert parent_indent < indent

    # attach the current tree
    parent_tree[line] = tree

    # update state
    prev_indent, prev_tree = indent, tree

print len(stack)
print stack
print root_tree

Collectives™ on Stack Overflow

converting a string to a tree structure in python

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related