2

I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?

There are lots of elements in the data that I do not need so I won't include those here.

For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.

<RefData>
  <Attributes>
    <Id>1011</Id>
    <FullName>xxxx</FullName>
    <ShortName>xx</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>000</PolicyID>
  <TradeDetails>
    <UniqueTradeId>000</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/2/2019</Date>
    </TradeDetails>
</RefData>
<RefData>
  <Attributes>
    <Id>1012</Id>
    <FullName>xxx2</FullName>
    <ShortName>x2</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>002</PolicyID>
  <TradeDetails>
    <UniqueTradeId>0022</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/3/2019</Date>
    </TradeDetails>
</RefData>

I would be needing everything in the tag.

Ideally I want the headers and output to look like this:

enter image description here

I would sincerely appreciate any help I can get on this. Thanks a mil.

2
  • medium.com/@robertopreste/… Commented Mar 26, 2020 at 11:38
  • Hi thanks for sharing however I tried this 2 days ago but it didn't work for me. the structuring of the xml file used in the post is quite different from mine. Commented Mar 26, 2020 at 11:49

2 Answers 2

4

One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements.

So the input file actually contains:

<Main>
  <RefData>
    ...
  </RefData>
  <RefData>
    ...
  </RefData>
</Main>

To process the input XML file, you can use lxml package, so to import it start from:

from lxml import etree as et

Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to:

  • read the content of each element as soon as it has been parsed,
  • save the content (text) of any child elements in any intermediate data structure (I chose a list of dictionaries),
  • drop the source XML element (not needed any more),
  • after the reading loop, create the result DataFrame from the above intermediate data structure.

So my code looks like below:

rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
    rows.append({'id':   elem.findtext('Attributes/Id'),
        'fullname':      elem.findtext('Attributes/FullName'),
        'shortname':     elem.findtext('Attributes/ShortName'),
        'country':       elem.findtext('Attributes/Country'),
        'currency':      elem.findtext('Attributes/Currency'),
        'Policy ID':     elem.findtext('PolicyID'),
        'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
        'Booking':       elem.findtext('TradeDetails/Booking'),
        'Date':          elem.findtext('TradeDetails/Date')
    })
    elem.clear()
    elem.getparent().remove(elem)
df = pd.DataFrame(rows)

To fully comprehend details, search the Web for description of lxml and each method used.

For your sample data the result is:

     id fullname shortname country currency Policy ID UniqueTradeId Booking      Date
0  1011     xxxx        xx      UK      GBP       000           000      UK 12/2/2019 
1  1012     xxx2        x2      UK      GBP       002          0022      UK 12/3/2019

Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.

Sign up to request clarification or add additional context in comments.

Comments

1

Another way to do it, using lxml and xpath:

   from lxml import etree
   dat = """[your FIXED xml]"""
   doc = etree.fromstring(dat)
   columns = []
   rows = []
   to_delete = ["TradeDetails",'Attributes']
   body = doc.xpath('.//RefData')
   for el in body[0].xpath('.//*'):
      columns.append(el.tag)

   for b in body:    
        items = b.xpath('.//*')
        row = []
        for item in items:
           if item.tag not in to_delete:
               row.append(item.text)
        rows.append(row)
   for col in to_delete:
      if col in columns:
         columns.remove(col)

    pd.DataFrame(rows,columns=columns)

Output is the dataframe indicated in your question.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.