Create a Pandas DataFrame from a Subset of Nested XML Data?

Question

I have a whole bunch of large XML files that contain thousands of records that look like this:

XML Sample:

<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>

The originals have hundreds of child elements at various depths under each record element - so I've simplified it a little here whilst still preserving the core problem. My aim is to read the XML to a pandas dataframe so that I have something like this to work with:

Record Id | Number | Status | Primary | Secondary
-------------------------------------------------
001       | 12345  | 1      | A       | B
-------------------------------------------------
002       | 789AB  | 2      |         | D
-------------------------------------------------
003       | CDEFG  |        | E       |

As you can see, most of the data is five levels deep and not every element is present in every record - but I need to be able to handle the missing elements as shown in the table above.

I have started to play around with lxml but I have literally no idea what I am doing! I know that I can (very clumsily) extract attributes or text by iterating over the tree as follows:

from lxml import etree as et
xtree = et.parse('file1.xml')
xroot = xtree.getroot()

for n in xroot.iter('Primary'):
    print(n.attrib['code'])

But... after this I've run out of steam. I'm not sure how to proceed and construct code so that I can be sure that any translated data actually corresponds with the record it originates with.

Can any kind soul offer any guidance to lead me out of the dark valley of XML and towards the sunlit uphills of pandas?

Any help would be extremely appreciated.

balderman · Accepted Answer · 2020-08-12 12:46:13Z

1

See below

import xml.etree.ElementTree as ET

xml = '''<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>'''

data = []
root = ET.fromstring(xml)
records = root.findall('.//Record')
for record in records:
  entry = {'id': record.find('./Id').attrib['root']}
  entry['Number'] = record.find('./Site/SiteData/SiteDataInfo1/Name').attrib['code']
  status = record.find('./Site/SiteData/SiteDataInfo1/Status')
  entry['Status'] = status.attrib['code'] if status is not None else ''
  primary = record.find('.//Primary')
  entry['Primary'] = primary.attrib['code'] if primary is not None else ''
  secondary = record.find('.//Secondary')
  entry['Secondary'] = secondary.attrib['code'] if secondary is not None else ''
  data.append(entry)

for entry in data:
  print(entry)

answered Aug 12, 2020 at 12:46

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

James Graham Over a year ago

You sir. are a wonderful human being. That worked like a dream! Thank you very much.

balderman Over a year ago

I am glad it works for you. I suggest that you will take some time in order to read the code and understand it.

Parfait Over a year ago

Consider a slight re-factor of code for clarity: rextester.com/LIUVZB29303

Rob Raymond · Accepted Answer · 2020-08-12 13:33:40Z

My normal approach is use xmlplain and then json_normalize

so.xml is just your sample xml saved to a file.

import pandas as pd
import xmlplain
from collections import OrderedDict 

with open("so.xml") as f: js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
df = pd.json_normalize(js['Report:Report'])
# work out columns that are info that do not form records
rootcols = [k for r in js['Report:Report'] for k in r.keys() for v in [r[k]] if not isinstance(v, OrderedDict)]
rootcols = [c for c in df.columns if c.split(".")[0] in rootcols]
# fill the columns that are "info" columns"
df.loc[:,rootcols] = df.loc[:,rootcols].fillna(method="ffill").fillna(method="bfill")
# drop rows that don't hold records
df = (df.dropna(how="all", subset=[c for c in df.columns if c not in rootcols])
 .reset_index(drop=True)
 # cleanup column names
 .rename(columns={c:c.replace("Record.Site.SiteData.","") for c in df.columns})
)

print(df.to_string(index=False))

output

        @xmlns:Report    Id.@root Records.@value  ReportDate Record.Id.@root SiteDataInfo1.Name.@code SiteDataInfo1.Status.@code SiteDataInfo2.Primary.@code SiteDataInfo2.Secondary.@code
 http://someplace.com  1234567890             10  2020-06-20             001                    12345                          1                           A                             B
 http://someplace.com  1234567890             10  2020-06-20             002                    789AB                          2                         NaN                             D
 http://someplace.com  1234567890             10  2020-06-20             003                    CDEFG                        NaN                           E                           NaN

Collectives™ on Stack Overflow

Create a Pandas DataFrame from a Subset of Nested XML Data?

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related