0

I have a list of nodes which i would like to remove from a xml document. But i am running into a issue while removing the elements and writing the modified document into a new xml file.

Here is a python program i wrote [I am using elementTree]

from xml.etree.ElementTree import ElementTree
    tree = ElementTree()
    tree.parse('autogen_test.xml')
    root = tree.getroot()
    keeper_data = ['4294905264']
    instances = tree.findall('./DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
    removeList = list()
    for instance in instances:
        #print instance
        data1 = instance.find('./DVAL/DVAL_ID')
        if data1.attrib.get("ID") not in keeper_data:
            removeList.append(instance)
    for tag in removeList:
        parent = tree.findall('./DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
        parent.remove(tag)    
tree.write("out.xml")

My sample xml is as below [this is a standard input and i cannot modify it]

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DIMENSIONS SYSTEM "dimensions.dtd">
<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334"/>
      <DIMENSION_NODE>
         <DVAL TYPE="EXACT">
            <DVAL_ID ID="2"/>
            <SYN DISPLAY="TRUE" SEARCH="FALSE" CLASSIFY="FALSE">Brand</SYN>
         </DVAL>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905325"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">hanes</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905315"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">lee</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905281"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">levi's</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
   </DIMENSIONS>

Even after iterating through the list and finding all the node to remove. The tree.write("out.xml") always prints out the original xml. Basically i will need to remove the identified from the original xml.

Expected Output:

<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334" />
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264" />
               <SYN CLASSIFY="TRUE" DISPLAY="TRUE" SEARCH="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
   </DIMENSIONS>
4
  • 1
    Code doesn't work parent is never defined. Commented Jul 22, 2015 at 22:07
  • @Noelkd tnx .. I editied the code to define the parent node. But i still run into same issue. Commented Jul 22, 2015 at 22:32
  • Indentation is broken now :( As far as I can understand you're trying to get the data about node 4294905264? Commented Jul 22, 2015 at 23:33
  • @Noelkd Yes I am expecting the data from 4294905264. Commented Jul 23, 2015 at 0:30

1 Answer 1

1

All DIMENSION_NODEs to be deleted share the same parent DIMENSION_NODE, so it would be more efficient to get it only once before looping through the removeList. More importantly, you want to get parent DIMENSION_NODE instead of the child DIMENSION_NODE, so the correct XPath for that is ./DIMENSION/DIMENSION_NODE. In short, try to change your 2nd for loop with the following codes :

parent = tree.find('./DIMENSION/DIMENSION_NODE')
for tag in removeList:
    parent.remove(tag)  

This is full working example for demo (only need to replace source value with the actual XML) :

import xml.etree.ElementTree as ET

source = """replace with the XML in question"""

root = ET.fromstring(source)
keeper_data = ['4294905264']
instances = root.findall('.//DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
removeList = list()
for instance in instances:
    data1 = instance.find('./DVAL/DVAL_ID')
    if data1.attrib.get("ID") not in keeper_data:
        removeList.append(instance)
parent = root.find('.//DIMENSION/DIMENSION_NODE')
for tag in removeList:
    parent.remove(tag)

print(ET.tostring(root))

given XML in question as value of source variable, the output is :

<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334" />
      <DIMENSION_NODE>
         <DVAL TYPE="EXACT">
            <DVAL_ID ID="2" />
            <SYN CLASSIFY="FALSE" DISPLAY="TRUE" SEARCH="FALSE">Brand</SYN>
         </DVAL>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264" />
               <SYN CLASSIFY="TRUE" DISPLAY="TRUE" SEARCH="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
</DIMENSIONS>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.