I have a heavily nested XML file which I am trying to convert to CSV - where I would like to pull all possible elements from the XML file.
In trying to use create a pandas dataframe from a nested xml file as the basis for my solution, my challenge is my dataframe will only contain the same line (containing the last URL (of many) of the xml file) which is repeated over and over.
My XML (trimmed to show a single expanded element):
<?xml version='1.0' encoding='UTF-8'?>
<REPORT>
<ELEMENT1>
...
</ELEMENT1>
<ELEMENT2>
...
</ELEMENT2>
<ELEMENT3>
...
</ELEMENT3>
<ELEMENT4>
...
</ELEMENT4>
<RESULTS>
<APPLICATION>
<ID>123456</ID>
<NAME>https://www.example.com/subsite/</NAME>
<LIST>
<LEVEL>
<UNIQUE_ID>123abc</UNIQUE_ID>
<ID>6666666</ID>
<URL>https://www.example.com/subsite/</URL>
<STATUS>ACTIVE</STATUS>
<TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
<TIMES_DETECTED>9</TIMES_DETECTED>
<PAYLOADS>
<PAYLOAD>
<NUM>1</NUM>
<REQUEST>
<METHOD>POST</METHOD>
<URL>https://www.example.com/subsite/</URL>
<HEADERS>
<HEADER>
<key>Host</key>
<value>randomdata</value>
</HEADER>
<HEADER>
<key>Content-Type</key>
<value>randomdata</value>
</HEADER>
</HEADERS>
</REQUEST>
<RESPONSE>
<CONTENTS base64="true">randomdata</CONTENTS>
</RESPONSE>
</PAYLOAD>
</PAYLOADS>
<IGNORED>false</IGNORED>
</LEVEL>
</LIST>
<BURP_ISSUES_LIST/>
<BUGCROWD_SUBMISSIONS_LIST/>
<SENSITIVE_CONTENT_LIST/>
<INFORMATION_GATHERED_LIST>
<INFORMATION_GATHERED>
<UNIQUE_ID>835ogk</UNIQUE_ID>
<ID>9459856</ID>
<TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
<DATA base64="true">randomdata</DATA>
</INFORMATION_GATHERED>
<INFORMATION_GATHERED>
<UNIQUE_ID>475hgv</UNIQUE_ID>
<ID>4569852</ID>
<TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
<DATA base64="true">randomdata</DATA>
</INFORMATION_GATHERED>
<INFORMATION_GATHERED>
<UNIQUE_ID>849ikg</UNIQUE_ID>
<ID>326614</ID>
<TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
<DATA base64="true">randomdata</DATA>
</INFORMATION_GATHERED>
</INFORMATION_GATHERED_LIST>
</APPLICATION>
<APPLICATION2>
</APPLICATION2>
</RESULTS>
<ELEMENT5>
...
</ELEMENT5>
</REPORT>
My code is below:
from lxml import etree as et
import pandas as pd
file_input = 'D:\file.xml'
file_output = 'D:\file.csv'
trees = et.parse(file_input)
d = []
for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'):
inner = {}
for elem in reportdata.xpath('//*'):
try:
if len(elem.text.strip()) > 0:
inner[elem.tag] = elem.text
except:
pass
d.append(inner)
df = pd.DataFrame(d)
df.to_csv(file_output, sep="|", index = None)
The result of this code is the dataframe contains 2000 of the same lines.
I have tried substituting the following line:
for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'):
with:
for reportdata in trees.xpath('//REPORT'):
however I only get a single line in my dataframe and I've also replaced it with:
for reportdata in trees.xpath('//*'):
however here the script runs for around 5 hours and returns around 100,000 lines of the exact same line again.
Ideally I would like to pull all possible data elements from the XML file. Please can I have advice as to how I can fix my code. Thank you
file.xmllooks like.