4

I made a program that extracts the text from a HTML file. It recurses down the HTML document and returns the list of tags. For eg,

input < li >no way < b > you < /b > are doing this < /li >

output ['no','way','you','are'...].

Here is a highly simplified pseudocode for this:

def get_leaves(node):
    kids=getchildren(node)
    for i in kids:
        if leafnode(i):
            get_leaves(i)
        else:
            a=process_leaf(i)
            list_of_leaves.append(a)

def calling_fn():
    list_of_leaves=[] #which is now in global scope
    get_leaves(rootnode)
    print list_of_leaves    

I am now using list_of_leaves in a global scope from the calling function. The calling_fn() declares this variable, get_leaves() appends to this.

My question is, how do I modify my function so that I am able to do something like list_of_leaves=get_leaves(rootnode), ie without using a global variable?

I dont want each instance of the function to duplicate the list, as the list can get quite big.

Please dont critisize the design of this particular pseudocode, as I simplified this. It is meant for another purpose: extracting tokens along with associated tags using BeautifulSoup

4
  • 2
    Have you considered a HTML parser like BeautifulSoup Commented Jul 12, 2011 at 8:11
  • 1
    At least from your pseudocode it doesn't seem list_of_leaves is global. Commented Jul 12, 2011 at 8:15
  • Please don't consider this a criticism -- it's not, but check out the output of this python expression (after importing re, of course): re.split(r'<.*?>', "< li >no way < b > you < /b > are doing this < /li >") Commented Jul 12, 2011 at 8:15
  • 2
    I AM using BeautifulSoup. And using regexes for HTML parsing turned out to be a very touchy topic, my account got temporarily suspended due to negative votes Commented Jul 12, 2011 at 9:00

5 Answers 5

10

You can pass the result list as optional argument.

def get_leaves(node, list_of_leaves=None):
    list_of_leaves = [] if list_of_leaves is None else list_of_leaves
    kids=getchildren(node)
    for i in kids:
        if leafnode(i):
            get_leaves(i, list_of_leaves)
        else:
            a=process_leaf(i)
            list_of_leaves.append(a)

def calling_fn():
    result = [] 
    get_leaves(rootnode, list_of_leaves=result)
    print result

Python objects are always passed by reference. This has been discussed before here. Some of the built-in types are immutable (e.g. int, string), so you cannot modify them in place (a new string is created when you concatenate two strings and assign them to a variable). Instance of mutable types (e.g. list) can be modified in place. We are taking advantage of this by passing the original list for accumulating result in our recursive calls.

For extracting text from HTML in a real application, using a mature library like BeautifulSoup or lxml.html is always a much better option (as others have suggested).

Sign up to request clarification or add additional context in comments.

3 Comments

I think you would be better of using : if list_of_leaves is None: list_of_leaves = []. For example the code: x = [] ;a = x or 'test' (assigns test to a). So that would mean the result list (in the calling function) would be still empty and what you are populating is a separate list in the method 'get_leaves'
Does this pass list_of_leaves by reference? Could you explain it?
@aitchnyu: explanation added.
8

No need to pass an accumulator to the function or accessing it through a global name if you turn get_leaves() into a generator:

def get_leaves(node):
    for child in getchildren(node):
        if leafnode(child):
            for each in get_leaves(child):
                yield each
        else:
            yield process_leaf(child)

def calling_fn():
    list_of_leaves = list(get_leaves(rootnode))
    print list_of_leaves

2 Comments

This is the best answer, as it lets you consume the data without storing it all up-front at all.
I think this is the most elegant way to do this.
2

Use a decent HTML parser like BeautifulSoup instead of trying smarter than existing software.

Comments

2

@pillmincher's generator answer is the best, but as another alternative, you can turn your function into a class:

class TagFinder:
    def __init__(self):
        self.leaves = []

    def get_leaves(self, node):
        kids = getchildren(node)
        for i in kids:
            if leafnode(i):
                self.get_leaves(i)
            else:
                a = process_leaf(i)
                self.list_of_leaves.append(a)
def calling_fn():
    finder = TagFinder()
    finder.get_leaves(rootnode)
    print finder.list_of_leaves

Your code likely involves a number of helper functions anyway, like leafnode, so a class also helps group them all together into one unit.

Comments

1

As a general question about recursion, this is a good one. It is common to have a recursive function that accumulates data into some collection. Either the collection needs to be a global variable (bad) or it is passed to the recursive function. When collections are passed in almost every language, only a reference is passed so you do not have to worry about space. Someone just posted an answer showing how to do this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.