I have a whole bunch of large XML files that contain thousands of records that look like this:
XML Sample:
<Report:Report xmlns:Report ="http://someplace.com">
<Id root="1234567890"/>
<Records value="10"/>
<ReportDate>2020-06-20</ReportDate>
<Record>
<Id root="001"/>
<Site>
<SiteData>
<SiteDataInfo1>
<Name code="12345"/>
<Status code="1"/>
</SiteDataInfo1>
<SiteDataInfo2>
<Primary code="A"/>
<Secondary code="B"/>
</SiteDataInfo2>
</SiteData>
</Site>
</Record>
<Record>
<Id root="002"/>
<Site>
<SiteData>
<SiteDataInfo1>
<Name code="789AB"/>
<Status code="2"/>
</SiteDataInfo1>
<SiteDataInfo2>
<Secondary code="D"/>
</SiteDataInfo2>
</SiteData>
</Site>
</Record>
<Record>
<Id root="003"/>
<Site>
<SiteData>
<SiteDataInfo1>
<Name code="CDEFG"/>
</SiteDataInfo1>
<SiteDataInfo2>
<Primary code="E"/>
</SiteDataInfo2>
</SiteData>
</Site>
</Record>
</Report:Report>
The originals have hundreds of child elements at various depths under each record element - so I've simplified it a little here whilst still preserving the core problem. My aim is to read the XML to a pandas dataframe so that I have something like this to work with:
Record Id | Number | Status | Primary | Secondary
-------------------------------------------------
001 | 12345 | 1 | A | B
-------------------------------------------------
002 | 789AB | 2 | | D
-------------------------------------------------
003 | CDEFG | | E |
As you can see, most of the data is five levels deep and not every element is present in every record - but I need to be able to handle the missing elements as shown in the table above.
I have started to play around with lxml but I have literally no idea what I am doing! I know that I can (very clumsily) extract attributes or text by iterating over the tree as follows:
from lxml import etree as et
xtree = et.parse('file1.xml')
xroot = xtree.getroot()
for n in xroot.iter('Primary'):
print(n.attrib['code'])
But... after this I've run out of steam. I'm not sure how to proceed and construct code so that I can be sure that any translated data actually corresponds with the record it originates with.
Can any kind soul offer any guidance to lead me out of the dark valley of XML and towards the sunlit uphills of pandas?
Any help would be extremely appreciated.