How to parse html nested blocks to lists with python BeautifulSoup?

Question

I am trying to convert structure like this (some nested xml/html)

<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
    ...
  </div>
  ...
</div>
...

clarification: it can be formatted like <div>a comment><div>an answer</div> or in any other way (not prettified etc)

(which has multiple nodes of different depth)

to corresponding list structure which has parent <ul> tags (i.e. ordinary html list)

<ul>
  <li>1
    <ul>
      <li>2</li>
      ...
   </ul>
  </li>
  ...
</ul>

I tried to use BeautifulSoup like this:

from bs4 import BeautifulSoup as BS

bs = BS(source_xml)
for i in bs.find_all('div'):
    i.name = 'i'

# but it only replaces div tags to li tags, I still need to add ul tags

I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
    # how to wrap the following iterated items in 'ul' tag?
    for j in i.find_all('div', recursive=False):
         ...

how can one add <ul> tags in right places? (I don't care about pretty printing etc, I need valid html structure with ul and li tags, tnx...)

I'm not sure if it's exactly what you need, but Beautiful Soup prettify might help you. It will basically take poorly structured HTML and pretty print with everything on its own line. — Cody Reichert
– Cody Reichert, Commented Oct 7, 2014 at 20:42

user3761405 · Accepted Answer · 2014-10-09 23:16:02Z

1

Depending on the way the HTML is formatted, just search for opening tags with no closed tag (would now be the beginning of a ul), an open & closed tag together (would be an li), or just a close tag (would be the end of a ul). Something similar to the code below. To make this more robust you could use BeautifulSoup's NavigableString

x = """<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
  </div>
</div>"""

xs = x.split("\n")


for tag in xs:
    if "<div" in tag and "</div" in tag:
        soup = BeautifulSoup(tag)
        html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
    elif "<div" in tag:
        html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
    elif "</div" in tag:
        html = "{}\n{}".format(html, "</ul>")

answered Oct 9, 2014 at 23:16

user3761405

585 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bob Over a year ago

unfortunatelly you are not truly parsing it with BeautifulSoup but actually with "if '<div' in line" etc but thnx anyway, your answer combined with Cody's comment above made me think to .prettify() first then to iterate lines and to check which tags are on them etc... this can actually work

Collectives™ on Stack Overflow

How to parse html nested blocks to lists with python BeautifulSoup?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related