Parsing and editing HTML files using Python

Question

Issue is following: Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content. I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:

I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?

For now I'm dealing with that as follows:

1) Make workcopy of base file

2) Open workcopy as I/O string in Python:

content = content_file.read()

3) Run this through html.parser object:

ModifyHtmlParser.feed(content)

4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:

def handle_starttag(self, tag, attrs):
    #print("Encountered a start tag:", tag)
    if tag == "tr":
        print("Table row start!")
        offset = self.getpos()
        tagText = self.get_starttag_text()

As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?

Ming · Accepted Answer · 2015-06-28 09:41:26Z

1

I would recommend you use the following general approach.

Load and parse the HTML into a convenient in-memory tree representation using any of the existing libraries for such tasks.
Find relevant nodes in the tree. (Most libraries from part 1 will provide some form of XPath and/or CSS selectors. Both allow you to find all nodes which satisfy a particular rule. In your case, the rule is probably "tr which ...".)
Process the found nodes individually (most libraries from part 1 will let you edit the tree in-place).
Write out either modified tree or newly generated tree.

Here is one particular example for how you could implement the above. (The exact choice of libraries is somewhat flexible. You have multiple options here.)

There's multiple options for HTML parsing and representation library. Most common recommendation I hear these days is LXML.
LXML provides both CSS selector support and XPath support.
See LXML etree documentation.

answered Jun 28, 2015 at 9:41

Ming

1,69314 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tomas Over a year ago

Hey, thank you for help. So in general, as I see, there is no out-of-the box solution available in Python without using additional libraries?

Ming Over a year ago

In my opinion, the benefit of a 3rd party library like LXML (or Beautiful Soup, or any of a number of alternatives, really) outweighs the cost of the added dependency. You can definitely do this with just the standard library HTTP library and HTML parser, but the end result will be a lot less maintainable.

Collectives™ on Stack Overflow

Parsing and editing HTML files using Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related