Parse a xml file with multiple root element in python

Question

i have a xml file, and i need to fetch some of the tags from it for some use, which have data like:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein1">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria1" direction="E"/>
        <neighbor name="Switzerland1" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia1" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

i need to parse this, so i used:

import xml.etree.ElementTree as ET
tree = ET.parse("myfile.xml")
root = tree.getroot()

this code giving error at line 2: xml.etree.ElementTree.ParseError: junk after document element:

i think this is because multiple xml tags, do you have any idea, how should i parse this?

"I have a xml file..." No, you don't. Where does the file come from? Is there a possibility of fixing the issue on that side? (It shouldn't be too hard to parse it, but if there's any way to avoid the invalid XML in the first place, that would be better.) — user94559
– user94559, Commented Aug 3, 2017 at 5:15
Together it is not a valid XML file. But you can split it before <?xml version="1.0"?> and parse the parts separately. — Klaus D.
– Klaus D., Commented Aug 3, 2017 at 5:16
@smarx what do you mean by is there a possibility.. ? i have given only sample data from the file, it does contain many more root elements like this... @KlausD. searching for the better option. — ggupta
– ggupta, Commented Aug 3, 2017 at 6:04
@ggupta I mean do you control the app that created that file, and can you fix it so it produces valid XML? — user94559
– user94559, Commented Aug 3, 2017 at 12:02
Then just split the file on the <?xml ... lines and parse each section (now a valid XML document) separately. — user94559
– user94559, Commented Aug 3, 2017 at 13:35

h3half · Accepted Answer · 2025-02-11 16:36:51Z

10

There's a simple trick I've used to parse such pseudo-XML (Wazuh rule files for what it matters) - just temporarily wrap it inside a fake element <whatever></whatever> thus forming a single root over all these "roots".

In your case, rather than having an invalid XML like this:

<data> ... </data>
<data> ... </data>

Just before passing it to the parser temporarily rewrite it as:

<whatever>
    <data> ... </data>
    <data> ... </data>
</whatever>

Then you parse it as usual and iterate <data> elements.

import xml.etree.ElementTree as etree
from pathlib import Path

file = Path('rules/0020-syslog_rules.xml')
data = b'<rules>' + file.read_bytes() + b'</rules>'
etree.fromstring(data)
etree.findall('group')
... array of Elements ...

edited Feb 11 at 16:36

h3half

3211 gold badge4 silver badges17 bronze badges

answered Jan 22, 2019 at 19:56

kravietz

11.4k2 gold badges40 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bill Bell · Accepted Answer · 2017-08-03 18:22:30Z

4

This code fills in details for one approach, if you want them.

The code watches for 'accumulated_xml until it encounters the beginning of another xml document or the end of the file. When it has a complete xml document it calls display to exercise the lxml library to parse the document and report some of the contents.

>>> from lxml import etree
>>> def display(alist):
...     tree = etree.fromstring(''.join(alist))
...     for country in tree.xpath('.//country'):
...         print(country.attrib['name'], country.find('rank').text, country.find('year').text)
...         print([neighbour.attrib['name'] for neighbour in country.xpath('neighbor')])
... 
>>> accumulated_xml = []
>>> with open('temp.xml') as temp:
...     while True:
...         line = temp.readline()
...         if line:
...             if line.startswith('<?xml'):
...                 if accumulated_xml:
...                     display (accumulated_xml)
...                     accumulated_xml = []
...             else:
...                 accumulated_xml.append(line.strip())
...         else:
...             display (accumulated_xml)
...             break
... 
Liechtenstein 1 2008
['Austria', 'Switzerland']
Singapore 4 2011
['Malaysia']
Panama 68 2011
['Costa Rica', 'Colombia']
Liechtenstein1 1 2008
['Austria1', 'Switzerland1']
Singapore 4 2011
['Malaysia1']
Panama 68 2011
['Costa Rica', 'Colombia']

answered Aug 3, 2017 at 18:22

Bill Bell

21.7k6 gold badges48 silver badges62 bronze badges

2 Comments

ggupta Over a year ago

thanks for this, i was just using the same approach, wonder there is no such python library for this.

Bill Bell Over a year ago

Whenever I use this way of splitting a file I think there must be a better way of expressing it in Python.

stovfl · Accepted Answer · 2017-08-04 08:35:42Z

3

Question: ... any idea, how should i parse this?

Filter the whole File and split into valid <?xml ... Chunks.
Creates myfile_01, myfile_02 ... myfile_nn.

n = 0
out_fh = None
with open('myfile.xml') as in_fh:
    while True:
        line = in_fh.readline()
        if not line: break

        if line.startswith('<?xml'):
            if out_fh:
                out_fh.close()
            n += 1
            out_fh = open('myfile_{:02}'.format(n))

        out_fh.write(line)

    out_fh.close()

If you want all <country> in one XML Tree:

import re
from xml.etree import ElementTree as ET

with open('myfile.xml') as fh:
    root = ET.fromstring('<?xml version="1.0"?><data>{}</data>'.
                         format(''.join(re.findall('<country.*?</country>', fh.read(), re.S)))
                                )

Tested with Python: 3.4.2

edited Aug 4, 2017 at 8:35

answered Aug 3, 2017 at 20:33

stovfl

15.6k7 gold badges26 silver badges54 bronze badges

2 Comments

ggupta Over a year ago

thanks for the suggestions, used same approach. thanks

ggupta Over a year ago

i was just finding the way to get parse the file, not any specific tag, your previous answer was helpful for me, thanks for modifying it.

Collectives™ on Stack Overflow

Parse a xml file with multiple root element in python

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related