Parsing XML to dataframe in python with same nodes

Question

I have this XML and i want to parse into panda's data frame:

<DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>   
  <CPE>PT0002000022161425NP</CPE>
  <CPE>PT0002000022161458JH</CPE>
  <CPE>PT0002000022161471ZP</CPE>
  <CPE>PT0002000022161505SL</CPE>
</DISTRITO>

and this is my Python code:

from lxml import objectify
from lxml import etree
import pandas as pd

path = '/TestFile.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
data = []

for i in root:     
    el_data = {}
    for child in root.getchildren():        
        el_data[child.tag] = child.pyval
       # print el_data
        data.append(el_data)

df = pd.DataFrame(data)

The problem is that when i get the result it only returns the last node "" value:

                    CPE NOME_DISTRITO
0  PT0002000022161505SL      BRAGANCA
1  PT0002000022161505SL      BRAGANCA
2  PT0002000022161505SL      BRAGANCA
3  PT0002000022161505SL      BRAGANCA
4  PT0002000022161505SL      BRAGANCA

I've digged a little into my XML file and i found that it happens when i get the same names for the nodes. For example if my file was this:

  <DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>   
  <CPE1>PT0002000022161425NP</CPE1>
  <CPE2>PT0002000022161458JH</CPE2>
  <CPE3>PT0002000022161471ZP</CPE3>
  <CPE4>PT0002000022161505SL</CPE4>
</DISTRITO>

there wouldn't be any problem. I have been searching a lot but i can't find a solution. So if you can help me and try to find another way to parse that file because i can't get it to work right.

Thank you guys!

Padraic Cunningham · Accepted Answer · 2016-07-13 20:41:12Z

1

You have two problems, first you are overwriting values if you have repeated keys in the inner loop, you are also appending a reference to the same dict/object in the loop so any changes you make are reflected everywhere hence you only see the last value each time.

You would need to create the dict inside the inner loop so you get add a new object each time:

 for child in root.getchildren():
    data.append({child.tag: child.pyval})

The above will give you all the values, I am not sure what exact format you want as I don't quite follow what your loops are supposed to be doing. This may be close to what you want:

x = """<DISTRITO xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <NOME_DISTRITO>BRAGANCA</NOME_DISTRITO>
  <CPE>PT0002000022161425NP</CPE>
  <CPE>PT0002000022161458JH</CPE>
  <CPE>PT0002000022161471ZP</CPE>
  <CPE>PT0002000022161505SL</CPE>
</DISTRITO>"""

root = etree.fromstring(x)
from lxml import objectify
import pandas as pd

root = objectify.fromstring(x)


df = pd.DataFrame(((child.tag, child.pyval) for child in root.getchildren()))

print(df)

Which would give you:

                      0         1
0         NOME_DISTRITO  BRAGANCA
1  PT0002000022161425NP       CPE
2  PT0002000022161458JH       CPE
3  PT0002000022161471ZP       CPE
4  PT0002000022161505SL       CPE

edited Jul 13, 2016 at 20:41

answered Jul 13, 2016 at 19:54

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Juliana Rivera Over a year ago

Thanks for your response. I just wanna understand a bit about etree, objectify and generally how to parse XML to a data frame with Python. I have a huge XML file and i want to convert it to a tabular file for a better reading. Thanks!! You helped me a lot. :)

Collectives™ on Stack Overflow

Parsing XML to dataframe in python with same nodes

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related