1

I want to remove elements of a certain tag value and then write out the .xml file WITHOUT any tags for those deleted elements; is my only option to create a new tree?

There are two options to remove/delete an element:

clear() Resets an element. This function removes all subelements, clears all attributes, and sets the text and tail attributes to None.

At first I used this and it works for the purpose of removing the data from the element but I'm still left with an empty element:

# Remove all elements from the tree that are NOT "job" or "make" or "build" elements
log = open("debug.log", "w")
for el in root.iter(*):

    if el.tag != "job" and el.tag != "make" and el.tag != "build":
        print("removed = ", el.tag, el.attrib, file=log)
        el.clear()
    else:
        print("NOT", el.tag, el.attrib, file=log)

log.close()
tree.write("make_and_job_tree.xml", short_empty_elements=False)

The problem is that xml.etree.ElementTree.ElementTree.write() still writes out empty tags no matter what:

...The keyword-only short_empty_elements parameter controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.

Why isn't there an option to just not print out those empty tags! Whatever.

So then I thought I might try

remove(subelement) Removes subelement from the element. Unlike the find* methods this method compares elements based on the instance identity, not on tag value or contents.

But this only operates on the child elements.

So I'd have to do something like:

for el in root.iter(*):
    for subel in el:
        if subel.tag != "make" and subel.tag != "job" and subel.tag != "build":
            el.remove(subel)

But there's a big problem here: I'm invalidating the iterator by removing elements, right?

Is it enough to simply check if the element is empty by adding if subel?:

if subel and subel.tag != "make" and subel.tag != "job" and subel.tag != "build"

Or do I have to get a new iterator to the tree elements every time I invalidate it?

Remember: I just wanted to write out the xml file with no tags for the empty elements.

Here's an example.

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

Let's say I want to remove any mention of neighbor. Ideally, I'd want this output after the removal:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
    </country>
</data>

Problem, is when I run the code using clear() (see first code block up above) and write it to a file, I get this:

<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor></neighbor><neighbor></neighbor></country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor></neighbor></country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor></neighbor><neighbor></neighbor></country>
</data>

Notice neighbor still appears.

I know I could easily run a regex over the output but there's gotta be a way (or another Python api) that does this on the fly instead of requiring me to touch my .xml file again.

9
  • Can you add a sample of your xml and what you want as output? Also are you open to using lxml? Commented Jun 23, 2016 at 22:44
  • @PadraicCunningham if lxml is in Python, yes. I don't care which API I use. I'll update with before and after of what I'm looking for. Commented Jun 23, 2016 at 22:48
  • is python a requirement? Commented Jun 23, 2016 at 23:04
  • @vtd-xml-author no. I just chose Python cause debugging is easy and I already used it. What do you have in mind? Commented Jun 23, 2016 at 23:05
  • 1
    @PadraicCunningham how can I make this question linked to my question? It's the question that answers my question. EDIT: actually doesn't answer it. Merely explains that one way to do it is not valid. Commented Jun 23, 2016 at 23:08

3 Answers 3

2
import lxml.etree as et

xml  = et.parse("test.xml")

for node in xml.xpath("//neighbor"):
    node.getparent().remove(node)


xml.write("out.xml",encoding="utf-8",xml_declaration=True)

Using elementTree, we need to find the parents of the neighbor nodes then find the neighbor nodes inside that parent and remove them:

from xml.etree import ElementTree as et

xml  = et.parse("test.xml")


for parent in xml.getroot().findall(".//neighbor/.."):
      for child in parent.findall("./neighbor"):
          parent.remove(child)


xml.write("out.xml",encoding="utf-8",xml_declaration=True)

Both will give you:

<?xml version='1.0' encoding='utf-8'?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        </country>
</data>

Using your attribute logic and modifying the xml a bit like below:

x = """<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
           <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>"""

Using lxml:

import lxml.etree as et

xml = et.fromstring(x)

for node in xml.xpath("//neighbor[not(@make) and not(@job) and not(@make)]"):
    node.getparent().remove(node)
print(et.tostring(xml))

Would give you:

 <data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        </country>
</data>

The same logic in ElementTree:

from xml.etree import ElementTree as et

xml = et.parse("test.xml").getroot()

atts = {"build", "job", "make"}

for parent in xml.findall(".//neighbor/.."):
    for child in parent.findall(".//neighbor")[:]:
        if not atts.issubset(child.attrib):
            parent.remove(child)

If you are using iter:

from xml.etree import ElementTree as et

xml = et.parse("test.xml")

for parent in xml.getroot().iter("*"):
    parent[:] = (child for child in parent if child.tag != "neighbor")

You can see we get the exact same output:

In [30]: !cat /home/padraic/untitled6/test.xml
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">#
      <neighbor name="Austria" direction="E"/>
        <rank>1</rank>
        <neighbor name="Austria" direction="E"/>
        <year>2008</year>
      <neighbor name="Austria" direction="E"/>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
In [31]: paste
def test():
    import lxml.etree as et
    xml = et.parse("/home/padraic/untitled6/test.xml")
    for node in xml.xpath("//neighbor"):
        node.getparent().remove(node)
    a = et.tostring(xml)
    from xml.etree import ElementTree as et
    xml = et.parse("/home/padraic/untitled6/test.xml")
    for parent in xml.getroot().iter("*"):
        parent[:] = (child for child in parent if child.tag != "neighbor")
    b = et.tostring(xml.getroot())
    assert  a == b

## -- End pasted text --

In [32]: test()
Sign up to request clarification or add additional context in comments.

12 Comments

Can you format the word "neighbor" more distinctly? When I first read your answer I thought you meant neighbor not "the tag called neighbor". I think the code format is appropriate. I'll try modifying your post first but for some reason my edits are never approved.
@Adrian, if I were using neighbour in a general context I would spell it correctly ;)
I didn't know you could do use not and concatenate things as in "//neighbor[not(@make) and not(@job) and not(@make)]"
Yes, lxml has complete xpath syntax support as well as a few extras lxml.de/extensions.html#xpath-extension-functions
it is a generator expression, the items are lazily evaluated, as far as the [:] syntax goes it selects all the node in the list/parent node, if you set parent = [...] all you would be doing is creating a binding of the name parent to the list not changing the not actually changing the object/parents lists content
|
1

Whenever modifying XML documents is needed, consider also XSLT, the special-purpose language part of the XSL family which includes XPath. XSLT is designed specifically to transform XML files. Pythoners are not quick to recommend it but it avoids the need of loops or nested if/then logic in general purpose code. Python's lxml module can run XSLT 1.0 scripts using the libxslt processor.

Below transformation runs the identity transform to copy document as is and then runs an empty template match on <neighbor> to remove it:

XSLT Script (save as an .xsl file to be loaded just like source .xml, both of which are well-formed xml files)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM TO COPY XML AS IS -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EMPTY TEMPLATE TO REMOVE NEIGHBOR WHEREVER IT EXISTS -->  
  <xsl:template match="neighbor"/>

</xsl:transform>

Python Script

import lxml.etree as et

# LOAD XML AND XSL DOCUMENTS
xml  = et.parse("Input.xml")
xslt = et.parse("Script.xsl")

# TRANSFORM TO NEW TREE
transform = et.XSLT(xslt)
newdom = transform(xml)

# CONVERT TO STRING
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)

# OUTPUT TO FILE
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()

Comments

1

The trick here is to find the parent (the country node), and delete the neighbor from there. In this example, I am using ElementTree because I am somewhat familiar with it:

import xml.etree.ElementTree as ET

if __name__ == '__main__':
    with open('debug.log') as f:
        doc = ET.parse(f)

        for country in doc.findall('.//country'):
            for neighbor in country.findall('neighbor'):
                country.remove(neighbor)

        ET.dump(doc)  # Display

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.