Extract data from XML using ElementTree in Python

Question

I have the following XML file, which I have to parse and extract data from it in a csv file. In this file I have two boxes (box_id), which are packed on two different parent objects (parent_box_id) and there are also the details of the content of each of the boxes (element sgtin -> info_sgtin).

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
    <info id_reference="2">
        <data_down>
            <tree>
                <box_id>046071598600870568</box_id>
                <parent_box_id>046071598600875594</parent_box_id>
            </tree>
            <tree>
                <box_id>046071598600870575</box_id>
                <parent_box_id>046071598600875595</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
                        <box_id>046071598600870568</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870568</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>046070081335856F7P78HBVBEH2</sgtin>
                        <box_id>046071598600870568</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870568</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>046070081335854T61H7CSXDE9W</sgtin>
                        <box_id>046071598600870575</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870575</parent_box_id>
            </tree>
        </data_down>
    </info>
</doc>

For this purpose I decided to use Elementtree in Python, but the problem is that in my XML file I have two variants of tag.

First of all I iterate through all the details and capture the box_id value, but after that I have to go to parent item and get the parent_box_id in which this box_id is packed.

In other words I want to get the data in the following way:

parent_box_id       box_id              sgtin                           series_number
046071598600875594  046071598600870568  04607008133585B0SE1HVHBGR3A     026A
046071598600875594  046071598600870568  046070081335856F7P78HBVBEH2     026A
046071598600875595  046071598600870575  046070081335854T61H7CSXDE9W     026A

But I can't figure out how to get parent_box_id value. Would appreciate any support from the community.

Here is the code that I have:

import csv
import xml.etree.ElementTree as ET

csv.writer(open('result.csv','w'),delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL))

tree = ET.parse('test.xml')
root = tree.getroot()

with open('result.csv','a',newline='') as myfile:
    writer = csv.writer(myfile, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for alist in root.iter('info_sgtin'):
    sgtin = alist.find('sgtin').text
    box_id = alist.find('box_id').text
    series = alist.find('series_number').text

    writer.writerow([sgtin,box_id,series])

the parent_box_id need to by matched with box_id inside first 2 tree and rest of data? — Zaraki Kenpachi
– Zaraki Kenpachi, Commented Feb 25, 2020 at 9:36
Hi @ZarakiKenpachi. Yes, you are right, in the beginning of XML we have a relations between box_id and parent_box_id, but in the details section we also have box_id and parent_box_id tags, which represent the same value, equal to box_id and for this one we have to identify the parent parent_box_id value from the beginning of the file. — Denik Gorbunov
– Denik Gorbunov, Commented Feb 25, 2020 at 12:31

Zaraki Kenpachi · Accepted Answer · 2020-02-25 12:51:53Z

1

You need to loop over every <tree> tag and check if there is data that you need. Then collect it.

import xml.etree.ElementTree


root = xml.etree.ElementTree.parse('data.xml')

# collect parent data
parent_data = {}
for item in root.iter('tree'):
    box_id_match = item.find('box_id')
    parent_box_id_match = item.find('parent_box_id')
    if box_id_match != None:
        parent_data.update({box_id_match.text: parent_box_id_match.text})

data = []
for item in root.iter('tree'):
    sgtin = item.find('sgtin/info_sgtin/sgtin')
    box_id = item.find('sgtin/info_sgtin/box_id')
    series_number = item.find('sgtin/info_sgtin/series_number')
    # collect valid data
    if sgtin != None and box_id != None and series_number != None:
        parent_box_id = parent_data.get(box_id.text)
        data.append([parent_box_id, box_id.text, sgtin.text, series_number.text])

Output:

['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A']
['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A']
['046071598600875595', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']

edited Feb 25, 2020 at 12:51

answered Feb 25, 2020 at 9:27

Zaraki Kenpachi

5,7702 gold badges17 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Denik Gorbunov Over a year ago

you actually specified parent_box_id value from the info_sgtin details (which is equal to box_id in this case), but it has to be looked up from the beginning of the file, where there is a relationship between box_id and parent_box_id.

Denik Gorbunov Over a year ago

When you try to collect the parent_data - you have the same tags in all "tree" elements and box_id_match and parent_id_match return values not only for the first two elements, where the aggregation is actually defined, but also for the details (where box_is is equal to parent_box_id).

Zaraki Kenpachi Over a year ago

@DenikGorbunov that is not true. Parent data will by collected only for tags that are defined just below tag <tree> like <tree> --> <box_id>. Rest of data will by handled as data to collect. Make test.

Denik Gorbunov Over a year ago

Thank you so much for your effort, with all your comments I understood how to get data from specific tags and check if they exist. I transformed the code itself, added some additional calls and I'm really happy that it finally works and solves the requirement. Appreciate your support!

sim · Accepted Answer · 2020-02-25 09:35:42Z

0

Here's a solution using XPATH (first collecting the mapping between box_id and parent_box_id from the immediate children of tree). Is that what you are looking for? I am not sure since 046071598600875595 is listed in your desired output as parent_box_id for box_id 046071598600870575 and I don't know where this is coming from.

root = etree.parse(fp, parser)
parent_ids = {elem.text: elem.xpath("following-sibling::parent_box_id")[0].text
              for elem in root.xpath("//*/tree/box_id")}

for alist in root.iter('info_sgtin'):
    sgtin = alist.find('sgtin').text
    box_id = alist.find('box_id').text
    series = alist.find('series_number').text
    print(sgtin, parent_ids[box_id], box_id, series)

Output:

04607008133585B0SE1HVHBGR3A 046071598600875594 046071598600870568 026A
046070081335856F7P78HBVBEH2 046071598600875594 046071598600870568 026A
046070081335854T61H7CSXDE9W 046071598600875594 046071598600870575 026A

If your files were large and it made sense to only iterate through them once, then you could use etree.iterparse with tag=["box_id"] or tag=["tree"]. In the former case, check whether you observe the siblings that you would expect in either case (sgtin, gtin, series_number or parent_box_id). If you find parent_box_id, then you add a new mapping to your lookup table (a dictionary that links box_ids to parent_box_ids. If you find sgtin and others, write out the data you collect from the siblings and get the parent_box_id from your lookup table.

Of course the iterative solution as described can only work this way if the structure is as such that the box_id to parent_box_id mappings always preceed the collections of sgtin, box_id, gtin and series_number.

edited Feb 25, 2020 at 9:35

answered Feb 25, 2020 at 9:12

sim

1,25714 silver badges21 bronze badges

3 Comments

Denik Gorbunov Over a year ago

Hi @sim, Thanks for details provided. Actually here are some of the details on your question: I am not sure since 046071598600875595 is listed in your desired output as parent_box_id for box_id 046071598600870575 and I don't know where this is coming from. - As you can see in the beginning of this XML there is a structure with the same tag <tree>, describing the relationships between box_id and parent_box_id elements and these details are from this part. These relations could be not always in the beginning of the file, but in any part as well ...

sim Over a year ago

@DenikGorbunov: The sample data you provided das not contain the parent_box_id 046071598600875595 - I believe the output from my post should be correct given I understood the question correctly. The code posted works regardless of where the <tree>-elements are that contain the parent_box_id to box_id mappings. You'd have to decide based on your specific application needs whether it is too costly to iterate over the files twice or not. Zaraki's solution has the advantage that the parent_box_id-element is not required to follow box_id.

Denik Gorbunov Over a year ago

You are right, that was my mistake, sorry for this typo. I edited the initial XML file that I used as an example.

dabingsou · Accepted Answer · 2020-02-25 14:13:55Z

Try this.

from simplified_scrapy import SimplifiedDoc
html = '''
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
    <info id_reference="2">
        <data_down>
            <tree>
                <box_id>046071598600870568</box_id>
                <parent_box_id>046071598600875594</parent_box_id>
            </tree>
            <tree>
                <box_id>046071598600870575</box_id>
                <parent_box_id>046071598600875594</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
                        <box_id>046071598600870568</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870568</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>046070081335856F7P78HBVBEH2</sgtin>
                        <box_id>046071598600870568</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870568</parent_box_id>
            </tree>
            <tree>
                <sgtin>
                    <info_sgtin>
                        <sgtin>046070081335854T61H7CSXDE9W</sgtin>
                        <box_id>046071598600870575</box_id>
                        <gtin>04607008133585</gtin>
                        <series_number>026A</series_number>
                    </info_sgtin>
                </sgtin>
                <parent_box_id>046071598600870575</parent_box_id>
            </tree>
        </data_down>
    </info>
</doc>
'''
doc = SimplifiedDoc(html)
boxIds = doc.selects('data_down>tree').notContains('<sgtin>')
dic = {}
for box in boxIds:
    dic[box.box_id.html]=box.parent_box_id.html
datas=[]
boxs = doc.selects('data_down>info_sgtin')
for box in boxs:
    datas.append([dic[box.box_id.html],box.box_id.html,box.sgtin.html,box.series_number.html])

print (datas)

Result:

[['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A'], ['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A'], ['046071598600875594', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']]

Thanks @dabingsou, this is really great Solution as well. Will take a not of this one.

Collectives™ on Stack Overflow

Extract data from XML using ElementTree in Python

3 Answers 3

4 Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related