2

If I have a nested html (unordered) list that looks like this:

<ul>
    <li><a href="Page1_Level1.html">Page1_Level1</a> 
    <ul>
        <li><a href="Page1_Level2.html">Page1_Level2</a> 
            <ul>
                <li><a href="Page1_Level3.html">Page1_Level3</a></li>
            </ul>
            <ul>
                <li><a href="Page2_Level3.html">Page2_Level3</a></li>
            </ul>
            <ul>
                <li><a href="Page3_Level3.html">Page3_Level3</a></li>
            </ul>
        </li>
    </ul>
    </li>
    <li><a href="Page2_Level1.html">Page2_Level1</a> 
    <ul>
        <li><a href="Page2_Level2.html">Page2_Level2</a></li>
    </ul>
    </li>
</ul>

How do I form a nested list out of it in Python? For example:

["Page1_Level1.html", ["Page1_Level2.html", ["Page1_Leve3.html", "Page2_Level3.html", "Page3_Level3.html"]], "Page2_Level1.html", ["Page2_Level2.html"]]

I presume libraries like Beautiful Soup and HTML Parser have facilities to do this, but I haven't been able it figure it out. Thanks for any help / pointers!

2 Answers 2

4

You can take a recursive approach:

from pprint import pprint
from bs4 import BeautifulSoup

text = """your html goes here"""

def find_li(element):
    return [{li.a['href']: find_li(li)}
            for ul in element('ul', recursive=False)
            for li in ul('li', recursive=False)]


soup = BeautifulSoup(text, 'html.parser')
data = find_li(soup)
pprint(data)

Prints:

[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
                                                 {u'Page2_Level3.html': []},
                                                 {u'Page3_Level3.html': []}]}]},
 {u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]

FYI, here is why I had to use html.parser here:

Sign up to request clarification or add additional context in comments.

Comments

1

It is an overview of a possible solution

# variable 'markup' contains the html string
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup)
for a in soup.descendants:
   # construct a nested list when going thru the descendants
   print id(a), id(a.parent) if a.parent else None, a

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.