0

I have a whole bunch of large XML files that contain thousands of records that look like this:

XML Sample:

<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>

The originals have hundreds of child elements at various depths under each record element - so I've simplified it a little here whilst still preserving the core problem. My aim is to read the XML to a pandas dataframe so that I have something like this to work with:

Record Id | Number | Status | Primary | Secondary
-------------------------------------------------
001       | 12345  | 1      | A       | B
-------------------------------------------------
002       | 789AB  | 2      |         | D
-------------------------------------------------
003       | CDEFG  |        | E       | 

As you can see, most of the data is five levels deep and not every element is present in every record - but I need to be able to handle the missing elements as shown in the table above.

I have started to play around with lxml but I have literally no idea what I am doing! I know that I can (very clumsily) extract attributes or text by iterating over the tree as follows:

from lxml import etree as et
xtree = et.parse('file1.xml')
xroot = xtree.getroot()

for n in xroot.iter('Primary'):
    print(n.attrib['code'])

But... after this I've run out of steam. I'm not sure how to proceed and construct code so that I can be sure that any translated data actually corresponds with the record it originates with.

Can any kind soul offer any guidance to lead me out of the dark valley of XML and towards the sunlit uphills of pandas?

Any help would be extremely appreciated.

2 Answers 2

1

See below

import xml.etree.ElementTree as ET

xml = '''<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>'''

data = []
root = ET.fromstring(xml)
records = root.findall('.//Record')
for record in records:
  entry = {'id': record.find('./Id').attrib['root']}
  entry['Number'] = record.find('./Site/SiteData/SiteDataInfo1/Name').attrib['code']
  status = record.find('./Site/SiteData/SiteDataInfo1/Status')
  entry['Status'] = status.attrib['code'] if status is not None else ''
  primary = record.find('.//Primary')
  entry['Primary'] = primary.attrib['code'] if primary is not None else ''
  secondary = record.find('.//Secondary')
  entry['Secondary'] = secondary.attrib['code'] if secondary is not None else ''
  data.append(entry)

for entry in data:
  print(entry)
Sign up to request clarification or add additional context in comments.

3 Comments

You sir. are a wonderful human being. That worked like a dream! Thank you very much.
I am glad it works for you. I suggest that you will take some time in order to read the code and understand it.
Consider a slight re-factor of code for clarity: rextester.com/LIUVZB29303
0

My normal approach is use xmlplain and then json_normalize

so.xml is just your sample xml saved to a file.

import pandas as pd
import xmlplain
from collections import OrderedDict 

with open("so.xml") as f: js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
df = pd.json_normalize(js['Report:Report'])
# work out columns that are info that do not form records
rootcols = [k for r in js['Report:Report'] for k in r.keys() for v in [r[k]] if not isinstance(v, OrderedDict)]
rootcols = [c for c in df.columns if c.split(".")[0] in rootcols]
# fill the columns that are "info" columns"
df.loc[:,rootcols] = df.loc[:,rootcols].fillna(method="ffill").fillna(method="bfill")
# drop rows that don't hold records
df = (df.dropna(how="all", subset=[c for c in df.columns if c not in rootcols])
 .reset_index(drop=True)
 # cleanup column names
 .rename(columns={c:c.replace("Record.Site.SiteData.","") for c in df.columns})
)

print(df.to_string(index=False))

output

        @xmlns:Report    Id.@root Records.@value  ReportDate Record.Id.@root SiteDataInfo1.Name.@code SiteDataInfo1.Status.@code SiteDataInfo2.Primary.@code SiteDataInfo2.Secondary.@code
 http://someplace.com  1234567890             10  2020-06-20             001                    12345                          1                           A                             B
 http://someplace.com  1234567890             10  2020-06-20             002                    789AB                          2                         NaN                             D
 http://someplace.com  1234567890             10  2020-06-20             003                    CDEFG                        NaN                           E                           NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.