How to replace HTML comments with custom <comment> elements

Question

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.

A sample HTML file looks something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.

Here is my code:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?

Could I convert it to a byte stream, via pickle, serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'.

Then I tried pickling just the text of the tag, via child.text, but deserialization failed due to AttributeError: can't set attribute. Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.

Would not result a bad-formed xml file once all changes applied? — Birei
– Birei, Commented Feb 18, 2015 at 17:21
I don't know but Chilkat has a (not free) HTML-to-XML conversion python library which converts all HTML comments to <comment> and the XML file looks good. — user3621633
– user3621633, Commented Feb 18, 2015 at 17:27

Zero Piraeus · Accepted Answer · 2015-02-18 19:12:45Z

5

You have a couple of problems:

You can't modify child.text. it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document.

re.sub() doesn't modify anything in-place. Your line

re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

would have had to be

child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

... but that wouldn't work anyway, because of point 1.

Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.

Here's a solution that works:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

This code starts by iterating over the result of a call to find_all(), taking advantage of the fact that we can pass a function as the text argument. In BeautifulSoup, Comment is a subclass of NavigableString, so we search for it as though it were a string, and the lambda ... is just a shorthand for e.g.

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.

Here's the result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

edited Feb 18, 2015 at 19:12

answered Feb 18, 2015 at 18:51

Zero Piraeus

59.7k28 gold badges158 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3621633 Over a year ago

Thank you so much for your help. I'm just reading through this now. If you can, would you please explain or elaborate each line in the python code, starting with the for. Again, I'm still new to Beautiful Soup. Thank you in advance! I'm not at all familiar with the usage of lamda.

Zero Piraeus Over a year ago

@user3621633 I've added an explanation of what the code does now.

user3621633 Over a year ago

This worked! Thank you. The only thing I would add is I needed it in XML format. So, I think I have to apply BeautifulSoup with xml just before writing out to the output file. If I use xml the first time, the HTML comments come out as HTML entities (i.e. <) and I'm not sure if BeautifulSoup can work with those.

Collectives™ on Stack Overflow

How to replace HTML comments with custom <comment> elements

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related