0

I have written a code to remove countries of those ranks which are not present in list lis from tes.xml and generating updated xml output.xml after removing the countries. But those countries are also coming in output which are not there in the list XML:

tes.xml

<?xml version="1.0"?>
<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">789045</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>

code:

import xml.etree.ElementTree as ET
tree = ET.parse('tes.xml')

lis = ["123456"]
root = tree.getroot()
print('root is', root)
print(type(root))

for continent in root.findall('.//continents'):
    for country in continent:
        rank = country.find('state/rank').text
        print(rank)
        if rank not in lis:
            continent.remove(country)

tree.write('outpu.xml')

console output: It is not even printing all the ranks from XML i.e. 67846464 is skipped so this rank will also be printed in the output.xml though it is not there in the list

root is <Element 'data' at 0x7f5929a9d8b0>
<class 'xml.etree.ElementTree.Element'>
123456
789045

Current output: having 2 ids 123456 and 67846464

<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E" />
        <neighbor name="Switzerland" direction="W" />
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N" />
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>

Expected output: only 123456 should come as 67846464 is not in the list

<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E" />
        <neighbor name="Switzerland" direction="W" />
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
  </continents>  
</data>
1

2 Answers 2

1

I got it to work fine with BeautifulSoup. I just stuck the XML code in as a string:

input = """
<?xml version="1.0"?>
<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">789045</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>
"""

And here's the real coding part:

from bs4 import BeautifulSoup

lis = ["123456"]

# Turn the XML into one big BS object
soup = BeautifulSoup(input, "lxml")

# Parse through to find all <country> tags.  
# From each, grab the <rank> value.  If the rank value
# is not in the list, delete the respective <country> tag.
for country in soup.find_all("country"):
    rank = country.find("rank").text
    if rank not in lis:
        country.decompose()

print(soup.prettify())

This gives me the expected output of the matching country. When I change lis to "["123456", "67846464"]", I get the expected 2 countries to output.

Sign up to request clarification or add additional context in comments.

2 Comments

but what if xml is too large like it has 60000 countries then can I put the whole xml in input variable?
I only pasted it in there as a string just to get the code working. If the XML file is that huge, then obviously you'll want to read it in some other way. Off-hand I don't know how well BeautifulSoup itself handles huge input.
1

The problem in your code is you are removing the elements from the continent while iterating.

for continent in root.findall('.//continents'):
    for country in continent.findall('./country'):
        if country.find('state/rank').text not in lis:
            continent.remove(country)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.