1

I am trying to convert structure like this (some nested xml/html)

<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
    ...
  </div>
  ...
</div>
...

clarification: it can be formatted like <div>a comment><div>an answer</div> or in any other way (not prettified etc)

(which has multiple nodes of different depth)

to corresponding list structure which has parent <ul> tags (i.e. ordinary html list)

<ul>
  <li>1
    <ul>
      <li>2</li>
      ...
   </ul>
  </li>
  ...
</ul>

I tried to use BeautifulSoup like this:

from bs4 import BeautifulSoup as BS

bs = BS(source_xml)
for i in bs.find_all('div'):
    i.name = 'i'

# but it only replaces div tags to li tags, I still need to add ul tags

I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
    # how to wrap the following iterated items in 'ul' tag?
    for j in i.find_all('div', recursive=False):
         ...

how can one add <ul> tags in right places? (I don't care about pretty printing etc, I need valid html structure with ul and li tags, tnx...)

1
  • 1
    I'm not sure if it's exactly what you need, but Beautiful Soup prettify might help you. It will basically take poorly structured HTML and pretty print with everything on its own line. Commented Oct 7, 2014 at 20:42

1 Answer 1

1

Depending on the way the HTML is formatted, just search for opening tags with no closed tag (would now be the beginning of a ul), an open & closed tag together (would be an li), or just a close tag (would be the end of a ul). Something similar to the code below. To make this more robust you could use BeautifulSoup's NavigableString

x = """<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
  </div>
</div>"""

xs = x.split("\n")


for tag in xs:
    if "<div" in tag and "</div" in tag:
        soup = BeautifulSoup(tag)
        html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
    elif "<div" in tag:
        html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
    elif "</div" in tag:
        html = "{}\n{}".format(html, "</ul>")
Sign up to request clarification or add additional context in comments.

1 Comment

unfortunatelly you are not truly parsing it with BeautifulSoup but actually with "if '<div' in line" etc but thnx anyway, your answer combined with Cody's comment above made me think to .prettify() first then to iterate lines and to check which tags are on them etc... this can actually work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.