1

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:

<?xml version="1.0"?>
<data>
    <more>
       <p></p>
    </more>
</data>
<?xml version="1.0"?>
<different data>
    <etc>
       <p></p>
    </etc>
</different data>
<?xml version="1.0"?>
<continues.....>

Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

1 Answer 1

2

You'll need to read in the documents separately; here is a generator function that'll yield complete XML documents from a given file object:

def xml_documents(fileobj):
    document = []
    for line in fileobj:
        if line.strip().startswith('<?xml') and document:
                yield ''.join(document)
                document = []
        document.append(line)

    if document:
        yield ''.join(document)

Then use ElementTree.fromstring() to load and parse these:

with open('file_with_multiple_xmldocuments') as fileobj:
    for xml in xml_documents(fileobj):
        tree = ElementTree.fromstring(xml)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.