0

I have a heavily nested XML file which I am trying to convert to CSV - where I would like to pull all possible elements from the XML file.

In trying to use create a pandas dataframe from a nested xml file as the basis for my solution, my challenge is my dataframe will only contain the same line (containing the last URL (of many) of the xml file) which is repeated over and over.

My XML (trimmed to show a single expanded element):

<?xml version='1.0' encoding='UTF-8'?>
<REPORT>
    <ELEMENT1>
    ...
    </ELEMENT1>
    <ELEMENT2>
    ...
    </ELEMENT2>
    <ELEMENT3>
    ...
    </ELEMENT3>
    <ELEMENT4>
    ...
    </ELEMENT4>
    <RESULTS>
        <APPLICATION>
            <ID>123456</ID>
            <NAME>https://www.example.com/subsite/</NAME>
            <LIST>
                <LEVEL>
                    <UNIQUE_ID>123abc</UNIQUE_ID>
                    <ID>6666666</ID>
                    <URL>https://www.example.com/subsite/</URL>
                    <STATUS>ACTIVE</STATUS>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <TIMES_DETECTED>9</TIMES_DETECTED>
                    <PAYLOADS>
                        <PAYLOAD>
                            <NUM>1</NUM>
                            <REQUEST>
                                <METHOD>POST</METHOD>
                                <URL>https://www.example.com/subsite/</URL>
                                <HEADERS>
                                    <HEADER>
                                        <key>Host</key>
                                        <value>randomdata</value>
                                    </HEADER>
                                    <HEADER>
                                        <key>Content-Type</key>
                                        <value>randomdata</value>
                                    </HEADER>
                                </HEADERS>
                            </REQUEST>
                            <RESPONSE>
                                <CONTENTS base64="true">randomdata</CONTENTS>
                            </RESPONSE>
                        </PAYLOAD>
                    </PAYLOADS>
                    <IGNORED>false</IGNORED>
                </LEVEL>
            </LIST>
            <BURP_ISSUES_LIST/>
            <BUGCROWD_SUBMISSIONS_LIST/>
            <SENSITIVE_CONTENT_LIST/>
            <INFORMATION_GATHERED_LIST>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>835ogk</UNIQUE_ID>
                    <ID>9459856</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>475hgv</UNIQUE_ID>
                    <ID>4569852</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>849ikg</UNIQUE_ID>
                    <ID>326614</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
            </INFORMATION_GATHERED_LIST>
        </APPLICATION>
        <APPLICATION2>
        </APPLICATION2>
    </RESULTS>
    <ELEMENT5>
    ...
    </ELEMENT5>
</REPORT>

My code is below:

from lxml import etree as et
import pandas as pd

file_input = 'D:\file.xml'
file_output = 'D:\file.csv'
trees = et.parse(file_input)

d = []
for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'):   
    inner = {}
    for elem in reportdata.xpath('//*'):       
        try:
            if len(elem.text.strip()) > 0:     
                inner[elem.tag] = elem.text
        except:
            pass
    d.append(inner)

df = pd.DataFrame(d)

df.to_csv(file_output, sep="|", index = None)

The result of this code is the dataframe contains 2000 of the same lines.

I have tried substituting the following line:

for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'): 

with:

for reportdata in trees.xpath('//REPORT'): 

however I only get a single line in my dataframe and I've also replaced it with:

for reportdata in trees.xpath('//*'): 

however here the script runs for around 5 hours and returns around 100,000 lines of the exact same line again.

Ideally I would like to pull all possible data elements from the XML file. Please can I have advice as to how I can fix my code. Thank you

3
  • It's going to be difficult to answer your question without knowing what file.xml looks like. Commented Mar 28, 2020 at 22:44
  • Upload the XML please Commented Mar 29, 2020 at 13:49
  • Thank you, I've added the XML. Commented Mar 29, 2020 at 15:41

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.