Duplication of single line when creating a pandas dataframe from a nested xml file

Ask Question

Asked 5 years, 8 months ago

Modified 5 years, 8 months ago

Viewed 55 times

I have a heavily nested XML file which I am trying to convert to CSV - where I would like to pull all possible elements from the XML file.

In trying to use create a pandas dataframe from a nested xml file as the basis for my solution, my challenge is my dataframe will only contain the same line (containing the last URL (of many) of the xml file) which is repeated over and over.

My XML (trimmed to show a single expanded element):

<?xml version='1.0' encoding='UTF-8'?>
<REPORT>
    <ELEMENT1>
    ...
    </ELEMENT1>
    <ELEMENT2>
    ...
    </ELEMENT2>
    <ELEMENT3>
    ...
    </ELEMENT3>
    <ELEMENT4>
    ...
    </ELEMENT4>
    <RESULTS>
        <APPLICATION>
            <ID>123456</ID>
            <NAME>https://www.example.com/subsite/</NAME>
            <LIST>
                <LEVEL>
                    <UNIQUE_ID>123abc</UNIQUE_ID>
                    <ID>6666666</ID>
                    <URL>https://www.example.com/subsite/</URL>
                    <STATUS>ACTIVE</STATUS>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <TIMES_DETECTED>9</TIMES_DETECTED>
                    <PAYLOADS>
                        <PAYLOAD>
                            <NUM>1</NUM>
                            <REQUEST>
                                <METHOD>POST</METHOD>
                                <URL>https://www.example.com/subsite/</URL>
                                <HEADERS>
                                    <HEADER>
                                        <key>Host</key>
                                        <value>randomdata</value>
                                    </HEADER>
                                    <HEADER>
                                        <key>Content-Type</key>
                                        <value>randomdata</value>
                                    </HEADER>
                                </HEADERS>
                            </REQUEST>
                            <RESPONSE>
                                <CONTENTS base64="true">randomdata</CONTENTS>
                            </RESPONSE>
                        </PAYLOAD>
                    </PAYLOADS>
                    <IGNORED>false</IGNORED>
                </LEVEL>
            </LIST>
            <BURP_ISSUES_LIST/>
            <BUGCROWD_SUBMISSIONS_LIST/>
            <SENSITIVE_CONTENT_LIST/>
            <INFORMATION_GATHERED_LIST>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>835ogk</UNIQUE_ID>
                    <ID>9459856</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>475hgv</UNIQUE_ID>
                    <ID>4569852</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
                <INFORMATION_GATHERED>
                    <UNIQUE_ID>849ikg</UNIQUE_ID>
                    <ID>326614</ID>
                    <TIME>14 Mar 2020 04:47AM GMT+0200</TIME>
                    <DATA base64="true">randomdata</DATA>
                </INFORMATION_GATHERED>
            </INFORMATION_GATHERED_LIST>
        </APPLICATION>
        <APPLICATION2>
        </APPLICATION2>
    </RESULTS>
    <ELEMENT5>
    ...
    </ELEMENT5>
</REPORT>

My code is below:

from lxml import etree as et
import pandas as pd

file_input = 'D:\file.xml'
file_output = 'D:\file.csv'
trees = et.parse(file_input)

d = []
for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'):   
    inner = {}
    for elem in reportdata.xpath('//*'):       
        try:
            if len(elem.text.strip()) > 0:     
                inner[elem.tag] = elem.text
        except:
            pass
    d.append(inner)

df = pd.DataFrame(d)

df.to_csv(file_output, sep="|", index = None)

The result of this code is the dataframe contains 2000 of the same lines.

I have tried substituting the following line:

for reportdata in trees.xpath('//REPORT/RESULTS/APPLICATION/LIST/LEVEL/URL'):

with:

for reportdata in trees.xpath('//REPORT'):

however I only get a single line in my dataframe and I've also replaced it with:

for reportdata in trees.xpath('//*'):

however here the script runs for around 5 hours and returns around 100,000 lines of the exact same line again.

Ideally I would like to pull all possible data elements from the XML file. Please can I have advice as to how I can fix my code. Thank you

edited Mar 29, 2020 at 15:40

asked Mar 28, 2020 at 19:19

Recycle_Bin28

551 gold badge3 silver badges15 bronze badges

It's going to be difficult to answer your question without knowing what file.xml looks like.

Jack Fleeting
– Jack Fleeting

2020-03-28 22:44:35 +00:00
Commented Mar 28, 2020 at 22:44
Upload the XML please

balderman
– balderman

2020-03-29 13:49:00 +00:00
Commented Mar 29, 2020 at 13:49
Thank you, I've added the XML.

Recycle_Bin28
– Recycle_Bin28

2020-03-29 15:41:36 +00:00
Commented Mar 29, 2020 at 15:41

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Duplication of single line when creating a pandas dataframe from a nested xml file

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked