0

I have one large text file that consists of concatenated XML files (I will call each of them an 'XML subfile').

I know that each new XML section starts when I come to the string

<?xml version = "1.0"?>

The goal is to parse each of the XML subfiles, but as a first step I need to either

My idea is to split the text file in separate XML files that I can then parse. (other ideas?)

How can I "loop" through the text file and split the file up? I cannot read the file as a whole as it is too large, I cannot loop over the lines (as the file is technically one row, there are no newlines in the file).

Any idea how to solve this in Python 3?

PS: Looks like this was a similar question, but the link is dead:

Link to other post

4
  • Do you mean that your subfiles are separated by actual double double quotes (two " symbols)? Or by empty line? Commented Jun 14, 2018 at 12:36
  • No, it is all in one line and the quotes are not indicative of a new sub file. The new sub file starts whenever I find the literal <?xml...> Commented Jun 14, 2018 at 14:29
  • Do you plan on getting back to your question at some point? Commented Jun 19, 2018 at 10:47
  • Works like charm thx Commented Jun 19, 2018 at 18:55

1 Answer 1

1

Assuming the input file is rather large and you maybe don't want to load it into memory in full, it would make sense to stream it.

Optimal would be generator that breaks the stream of incoming lines from the file into chunks at certain points, i.e. when a line is equal to your "splitting" line.

This can be generalized as a function that can split any iterable into groups. itertools.groupby lends itself to the task, all we need to do is increment an index when we hit the "split here" value, and use that index as the group key:

from itertools import groupby

def split_chunks(values, split_val):
    '''splits a list of values into chunks at a certain value'''

    index = 0
    def chunk_index(val):
        nonlocal index
        if val == split_val:
            index += 1
        return index

    return groupby(values, chunk_index)

Test - let's split a list of numbers into chunks at 0:

for i, numbers in split_chunks([0,1,2,3,0,4,5,6,0,7,8,9], 0):
     print(list(numbers))

prints

[0, 1, 2, 3]
[0, 4, 5, 6]
[0, 7, 8 ,9]

The empty line appears because there is nothing before the first 0 in the input. Exactly the same thing happens when you split a string 'abcabc'.split('a').

So this works, usage with "lines in a large text file" instead of "numbers" is simple:

import xml.etree.ElementTree as ET

with open('large_container_file', 'r', encoding='utf8') as container_file:
    for doc_num, doc in split_chunks(container_file, '<?xml version="1.0"?>'):
        print(f'processing sub-document #{doc_num}')
        tree = ET.fromstringlist(doc)

Make sure you open the container file with the correct encoding.

Since generators only do work when you advance the iteration, reading of the large_container_file stops while you process the current tree, so memory usage should be fairly low independently of the input file size.


doc is a generator in this scenario, which is good, because it is very memory-efficient. But in contrast to a list, you can't easily find out if it is going to be empty, which will happen in your case if '<?xml version="1.0"?>' is the very first line in the document.

ET.fromstringlist() is happy with generators, but it will throw when it finds that the generator is empty. However, it will also throw when there is an error in the XML, so what I would do is add a try:

try:
    tree = ET.fromstringlist(doc)
except:
    pass

Alternatively you can call list() up-front and then check if there are any lines:

lines = list(doc)
if lines:
    tree = ET.fromstringlist(lines)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.