Assuming the input file is rather large and you maybe don't want to load it into memory in full, it would make sense to stream it.
Optimal would be generator that breaks the stream of incoming lines from the file into chunks at certain points, i.e. when a line is equal to your "splitting" line.
This can be generalized as a function that can split any iterable into groups. itertools.groupby lends itself to the task, all we need to do is increment an index when we hit the "split here" value, and use that index as the group key:
from itertools import groupby
def split_chunks(values, split_val):
'''splits a list of values into chunks at a certain value'''
index = 0
def chunk_index(val):
nonlocal index
if val == split_val:
index += 1
return index
return groupby(values, chunk_index)
Test - let's split a list of numbers into chunks at 0:
for i, numbers in split_chunks([0,1,2,3,0,4,5,6,0,7,8,9], 0):
print(list(numbers))
prints
[0, 1, 2, 3]
[0, 4, 5, 6]
[0, 7, 8 ,9]
The empty line appears because there is nothing before the first 0 in the input. Exactly the same thing happens when you split a string 'abcabc'.split('a').
So this works, usage with "lines in a large text file" instead of "numbers" is simple:
import xml.etree.ElementTree as ET
with open('large_container_file', 'r', encoding='utf8') as container_file:
for doc_num, doc in split_chunks(container_file, '<?xml version="1.0"?>'):
print(f'processing sub-document #{doc_num}')
tree = ET.fromstringlist(doc)
Make sure you open the container file with the correct encoding.
Since generators only do work when you advance the iteration, reading of the large_container_file stops while you process the current tree, so memory usage should be fairly low independently of the input file size.
doc is a generator in this scenario, which is good, because it is very memory-efficient. But in contrast to a list, you can't easily find out if it is going to be empty, which will happen in your case if '<?xml version="1.0"?>' is the very first line in the document.
ET.fromstringlist() is happy with generators, but it will throw when it finds that the generator is empty. However, it will also throw when there is an error in the XML, so what I would do is add a try:
try:
tree = ET.fromstringlist(doc)
except:
pass
Alternatively you can call list() up-front and then check if there are any lines:
lines = list(doc)
if lines:
tree = ET.fromstringlist(lines)