I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.
A sample HTML file looks something like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<!-- here is a comment inside the head tag -->
</head>
<body>
...
<!-- Comment inside body tag -->
<!-- Another comment inside body tag -->
<!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
</body>
</html>
<!-- This comment is the last line of the file -->
I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.
Here is my code:
file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")
for child in soup.children:
# This takes care of the first two HTML comments
if isinstance(child, bs4.Comment):
child.replace_with("<comment>" + child.strip() + "</comment>")
# This should find all nested HTML comments and replace.
# It looks like it works but the changes are not finalized
if isinstance(child, bs4.Tag):
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)
# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))
This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?
Could I convert it to a byte stream, via pickle, serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?
I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'.
Then I tried pickling just the text of the tag, via child.text, but deserialization failed due to AttributeError: can't set attribute. Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.
xmlfile once all changes applied?Chilkathas a (not free) HTML-to-XML conversion python library which converts all HTML comments to<comment>and the XML file looks good.