0

Issue is following: Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content. I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:

I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?

For now I'm dealing with that as follows:

1) Make workcopy of base file

2) Open workcopy as I/O string in Python:

content = content_file.read()

3) Run this through html.parser object:

ModifyHtmlParser.feed(content)

4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:

def handle_starttag(self, tag, attrs):
    #print("Encountered a start tag:", tag)
    if tag == "tr":
        print("Table row start!")
        offset = self.getpos()
        tagText = self.get_starttag_text()

As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?

1 Answer 1

1

I would recommend you use the following general approach.

  1. Load and parse the HTML into a convenient in-memory tree representation using any of the existing libraries for such tasks.
  2. Find relevant nodes in the tree. (Most libraries from part 1 will provide some form of XPath and/or CSS selectors. Both allow you to find all nodes which satisfy a particular rule. In your case, the rule is probably "tr which ...".)
  3. Process the found nodes individually (most libraries from part 1 will let you edit the tree in-place).
  4. Write out either modified tree or newly generated tree.

Here is one particular example for how you could implement the above. (The exact choice of libraries is somewhat flexible. You have multiple options here.)

  1. There's multiple options for HTML parsing and representation library. Most common recommendation I hear these days is LXML.
  2. LXML provides both CSS selector support and XPath support.
  3. See LXML etree documentation.
Sign up to request clarification or add additional context in comments.

2 Comments

Hey, thank you for help. So in general, as I see, there is no out-of-the box solution available in Python without using additional libraries?
In my opinion, the benefit of a 3rd party library like LXML (or Beautiful Soup, or any of a number of alternatives, really) outweighs the cost of the added dependency. You can definitely do this with just the standard library HTTP library and HTML parser, but the end result will be a lot less maintainable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.