2

I have this XML and i want to parse into panda's data frame:

<DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>   
  <CPE>PT0002000022161425NP</CPE>
  <CPE>PT0002000022161458JH</CPE>
  <CPE>PT0002000022161471ZP</CPE>
  <CPE>PT0002000022161505SL</CPE>
</DISTRITO>

and this is my Python code:

from lxml import objectify
from lxml import etree
import pandas as pd

path = '/TestFile.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
data = []

for i in root:     
    el_data = {}
    for child in root.getchildren():        
        el_data[child.tag] = child.pyval
       # print el_data
        data.append(el_data)

df = pd.DataFrame(data)

The problem is that when i get the result it only returns the last node "" value:

                    CPE NOME_DISTRITO
0  PT0002000022161505SL      BRAGANCA
1  PT0002000022161505SL      BRAGANCA
2  PT0002000022161505SL      BRAGANCA
3  PT0002000022161505SL      BRAGANCA
4  PT0002000022161505SL      BRAGANCA

I've digged a little into my XML file and i found that it happens when i get the same names for the nodes. For example if my file was this:

  <DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>   
  <CPE1>PT0002000022161425NP</CPE1>
  <CPE2>PT0002000022161458JH</CPE2>
  <CPE3>PT0002000022161471ZP</CPE3>
  <CPE4>PT0002000022161505SL</CPE4>
</DISTRITO>

there wouldn't be any problem. I have been searching a lot but i can't find a solution. So if you can help me and try to find another way to parse that file because i can't get it to work right.

Thank you guys!

0

1 Answer 1

1

You have two problems, first you are overwriting values if you have repeated keys in the inner loop, you are also appending a reference to the same dict/object in the loop so any changes you make are reflected everywhere hence you only see the last value each time.

You would need to create the dict inside the inner loop so you get add a new object each time:

 for child in root.getchildren():
    data.append({child.tag: child.pyval})

The above will give you all the values, I am not sure what exact format you want as I don't quite follow what your loops are supposed to be doing. This may be close to what you want:

x = """<DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>
  <CPE>PT0002000022161425NP</CPE>
  <CPE>PT0002000022161458JH</CPE>
  <CPE>PT0002000022161471ZP</CPE>
  <CPE>PT0002000022161505SL</CPE>
</DISTRITO>"""

root = etree.fromstring(x)
from lxml import objectify
import pandas as pd

root = objectify.fromstring(x)


df = pd.DataFrame(((child.tag, child.pyval) for child in root.getchildren()))

print(df)

Which would give you:

                      0         1
0         NOME_DISTRITO  BRAGANCA
1  PT0002000022161425NP       CPE
2  PT0002000022161458JH       CPE
3  PT0002000022161471ZP       CPE
4  PT0002000022161505SL       CPE
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your response. I just wanna understand a bit about etree, objectify and generally how to parse XML to a data frame with Python. I have a huge XML file and i want to convert it to a tabular file for a better reading. Thanks!! You helped me a lot. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.