1

In an ideal world, I'm trying to figure out how to load an html document into a list which is elements, for example:

elements=[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell']]]]

I've played a little with beautifulsoup, but can't see a way to do this.

Is this currently doable, or do I nee to write a parser.

2 Answers 2

0

In an ideal world(definition: One where the website you want to read has well-formed XHTML), you can toss it to an XML parser like lxml and you'll get something much like that back. Very short version:

  • Elements are lists, and the entries in the list are subelements, in proper order
  • Elements are dictionaries, which have the "key=value" attributes from the element.
  • Elements have a text attribute, which is the text between the opening element and it's first sub-element
  • Elements have a tail attribute, which is the text after the closing element.

Once you have a tree in a shape like that, then you can probably write a 3-line function that rebuilds it the way you want.


XHTML is basically restricted HTML - a combination between that and XML. In theory, sites should give your browser XHTML, since it's better in every way, but most browsers are a lot more permissive, and therefore don't provide the stricter set.

Some of the problems most sites have are for example the omitting of closing tags. XML parsers tend to error out on those.

Sign up to request clarification or add additional context in comments.

Comments

0

You can use recursion:

html = """
<html>
  <body>
     <h1>This is the first heading.</h1>
     <p>Someone made a paragraph. A short one.</p>
     <table>
       <tr>
         <td>a table cell</td>
       <tr>
     </table>
  </body>
</html>
"""
import bs4
def to_list(d):
   return [d.name, *[to_list(i) if not isinstance(i, bs4.element.NavigableString) else i for i in d.contents if i != '\n']]

_, *r = to_list(bs4.BeautifulSoup(html).body)
print(r)

Output:

[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell'], ['tr']]]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.