How can I load an html file into a multilevel array of elements in python

Question

In an ideal world, I'm trying to figure out how to load an html document into a list which is elements, for example:

elements=[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell']]]]

I've played a little with beautifulsoup, but can't see a way to do this.

Is this currently doable, or do I nee to write a parser.

Gloweye · Accepted Answer · 2019-11-11 12:24:08Z

In an ideal world(definition: One where the website you want to read has well-formed XHTML), you can toss it to an XML parser like lxml and you'll get something much like that back. Very short version:

Elements are lists, and the entries in the list are subelements, in proper order
Elements are dictionaries, which have the "key=value" attributes from the element.
Elements have a text attribute, which is the text between the opening element and it's first sub-element
Elements have a tail attribute, which is the text after the closing element.

Once you have a tree in a shape like that, then you can probably write a 3-line function that rebuilds it the way you want.

XHTML is basically restricted HTML - a combination between that and XML. In theory, sites should give your browser XHTML, since it's better in every way, but most browsers are a lot more permissive, and therefore don't provide the stricter set.

Some of the problems most sites have are for example the omitting of closing tags. XML parsers tend to error out on those.

Ajax1234 · Accepted Answer · 2019-11-11 13:44:26Z

0

You can use recursion:

html = """
<html>
  <body>
     <h1>This is the first heading.</h1>
     <p>Someone made a paragraph. A short one.</p>
     <table>
       <tr>
         <td>a table cell</td>
       <tr>
     </table>
  </body>
</html>
"""
import bs4
def to_list(d):
   return [d.name, *[to_list(i) if not isinstance(i, bs4.element.NavigableString) else i for i in d.contents if i != '\n']]

_, *r = to_list(bs4.BeautifulSoup(html).body)
print(r)

Output:

[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell'], ['tr']]]]

answered Nov 11, 2019 at 13:44

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Collectives™ on Stack Overflow

How can I load an html file into a multilevel array of elements in python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related