Parsing large combined XML document with Python

Question

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:

<?xml version="1.0"?>
<data>
    <more>
       <p></p>
    </more>
</data>
<?xml version="1.0"?>
<different data>
    <etc>
       <p></p>
    </etc>
</different data>
<?xml version="1.0"?>
<continues.....>

Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

Martijn Pieters · Accepted Answer · 2013-03-26 18:52:26Z

2

You'll need to read in the documents separately; here is a generator function that'll yield complete XML documents from a given file object:

def xml_documents(fileobj):
    document = []
    for line in fileobj:
        if line.strip().startswith('<?xml') and document:
                yield ''.join(document)
                document = []
        document.append(line)

    if document:
        yield ''.join(document)

Then use ElementTree.fromstring() to load and parse these:

with open('file_with_multiple_xmldocuments') as fileobj:
    for xml in xml_documents(fileobj):
        tree = ElementTree.fromstring(xml)

answered Mar 26, 2013 at 18:52

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing large combined XML document with Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related