Converting complex XML file to Pandas dataframe/CSV - Python

Question

I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?

There are lots of elements in the data that I do not need so I won't include those here.

For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.

<RefData>
  <Attributes>
    <Id>1011</Id>
    <FullName>xxxx</FullName>
    <ShortName>xx</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>000</PolicyID>
  <TradeDetails>
    <UniqueTradeId>000</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/2/2019</Date>
    </TradeDetails>
</RefData>
<RefData>
  <Attributes>
    <Id>1012</Id>
    <FullName>xxx2</FullName>
    <ShortName>x2</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>002</PolicyID>
  <TradeDetails>
    <UniqueTradeId>0022</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/3/2019</Date>
    </TradeDetails>
</RefData>

I would be needing everything in the tag.

Ideally I want the headers and output to look like this:

I would sincerely appreciate any help I can get on this. Thanks a mil.

Hi thanks for sharing however I tried this 2 days ago but it didn't work for me. the structuring of the xml file used in the post is quite different from mine. — Roseline Adediran
– Roseline Adediran, Commented Mar 26, 2020 at 11:49

Valdi_Bo · Accepted Answer · 2020-03-26 12:38:57Z

One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements.

So the input file actually contains:

<Main>
  <RefData>
    ...
  </RefData>
  <RefData>
    ...
  </RefData>
</Main>

To process the input XML file, you can use lxml package, so to import it start from:

from lxml import etree as et

Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to:

read the content of each element as soon as it has been parsed,
save the content (text) of any child elements in any intermediate data structure (I chose a list of dictionaries),
drop the source XML element (not needed any more),
after the reading loop, create the result DataFrame from the above intermediate data structure.

So my code looks like below:

rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
    rows.append({'id':   elem.findtext('Attributes/Id'),
        'fullname':      elem.findtext('Attributes/FullName'),
        'shortname':     elem.findtext('Attributes/ShortName'),
        'country':       elem.findtext('Attributes/Country'),
        'currency':      elem.findtext('Attributes/Currency'),
        'Policy ID':     elem.findtext('PolicyID'),
        'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
        'Booking':       elem.findtext('TradeDetails/Booking'),
        'Date':          elem.findtext('TradeDetails/Date')
    })
    elem.clear()
    elem.getparent().remove(elem)
df = pd.DataFrame(rows)

To fully comprehend details, search the Web for description of lxml and each method used.

For your sample data the result is:

     id fullname shortname country currency Policy ID UniqueTradeId Booking      Date
0  1011     xxxx        xx      UK      GBP       000           000      UK 12/2/2019 
1  1012     xxx2        x2      UK      GBP       002          0022      UK 12/3/2019

Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.

Jack Fleeting · Accepted Answer · 2020-03-26 12:58:07Z

1

Another way to do it, using lxml and xpath:

   from lxml import etree
   dat = """[your FIXED xml]"""
   doc = etree.fromstring(dat)
   columns = []
   rows = []
   to_delete = ["TradeDetails",'Attributes']
   body = doc.xpath('.//RefData')
   for el in body[0].xpath('.//*'):
      columns.append(el.tag)

   for b in body:    
        items = b.xpath('.//*')
        row = []
        for item in items:
           if item.tag not in to_delete:
               row.append(item.text)
        rows.append(row)
   for col in to_delete:
      if col in columns:
         columns.remove(col)

    pd.DataFrame(rows,columns=columns)

Output is the dataframe indicated in your question.

edited Mar 26, 2020 at 12:58

answered Mar 26, 2020 at 12:45

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Collectives™ on Stack Overflow

Converting complex XML file to Pandas dataframe/CSV - Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related