7

I am trying to build a python script that will take in an XML document and remove all of the comment blocks from it.

I tried something along the lines of:

tree = ElementTree()
tree.parse(file)
commentElements = tree.findall('//comment()')

for element in commentElements:
    element.parentNode.remove(element)

Doing this yields a weird error from python: "KeyError: '()'

I know there are ways to easily edit the file using other methods ( like sed ), but I have to do it in a python script.

4
  • 1
    Could you maybe add a little example XML document? Commented May 3, 2012 at 17:51
  • 1
    '//comment()' does not seem to be a valid search path format and is causing the KeyError. Can you please include that XML sample and expand on the pattern you are trying to catch? Commented May 3, 2012 at 18:27
  • 2
    comment() is an XPath node test that is not supported by ElementTree. Try lxml, which has full support for XPath 1.0. Commented May 3, 2012 at 18:35
  • lxml also implements the etree interface, AFAIK Commented May 3, 2012 at 19:25

4 Answers 4

12

comment() is an XPath node test that is not supported by ElementTree.

You can use comment() with lxml. This library is quite similar to ElementTree and it has full support for XPath 1.0.

Here is how you can remove comments with lxml:

from lxml import etree

XML = """<root>
  <!-- COMMENT 1 -->
  <x>TEXT 1</x>
  <y>TEXT 2 <!-- COMMENT 2 --></y>
</root>"""

tree = etree.fromstring(XML)

comments = tree.xpath('//comment()')

for c in comments:
    p = c.getparent()
    p.remove(c)

print etree.tostring(tree)

Output:

<root>
  <x>TEXT 1</x>
  <y>TEXT 2 </y>
</root>
Sign up to request clarification or add additional context in comments.

Comments

8

Use strip_tags() from lxml.etree

from lxml import etree
XML = """<root>
  <!-- COMMENT 1 -->
  <x>TEXT 1</x>
  <y>TEXT 2 <!-- COMMENT 2 --></y>
  </root>"""

tree = etree.fromstring(XML)
print etree.tostring(tree)
etree.strip_tags(tree,etree.Comment)
print etree.tostring(tree)

Output:

<root>
<!-- COMMENT 1 -->
<x>TEXT 1</x>
<y>TEXT 2 <!-- COMMENT 2 --></y>
</root>
<root>

<x>TEXT 1</x>
<y>TEXT 2 </y>
</root>

Comments

6

The same as

https://stackoverflow.com/a/3317008/1458574

from lxml import etree
import sys

XML = open(sys.argv[1]).read()
parser =  etree.XMLParser(remove_comments=True)
tree= etree.fromstring(XML, parser = parser)
print etree.tostring(tree)

1 Comment

remove_comments=True works fine, but it's not used in the linked answer. So why do you say that it is "the same"?
3

This is the solution I implemented using minidom:

 def removeCommentNodes(self):
        for tag in self.dom.getElementsByTagName("*"):
            for n in tag.childNodes:
                if n.nodeType is dom.Node.COMMENT_NODE:
                    n.parentNode.removeChild(n)

In practice I first retrieve all the tags in the xml, then for each tag I look for comment nodes and if found I remove them. (self.dom is a reference to the parsed xml)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.