1

EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.

This works on files up to about 600mb in size, larger than that and I run out of memory (I have a 16gb machine). What can I do to read in a file in pieces, or read in a certain percentage of the xml at a time or is there a less memory intensive approach?

import csv
import xml.etree.ElementTree as ET
from lxml import etree
import time
import sys

def main(argv):
    start_time = time.time()

#file_name = 'sample.xml'
file_name = argv
root = ET.ElementTree(file=file_name).getroot() 
csv_file_name = '.'.join(file_name.split('.')[:-1]) + ".txt"
print '\n'
print 'Output file:'
print csv_file_name

with open(csv_file_name, 'w') as file_:
    writer = csv.writer(file_, delimiter="\t")
    header = [ <the names of the tags here> ]
    writer.writerow(header)
    tags = [
        <bunch of xml tags here>    
            ]

    #write the values
#     for index in range(8,1000):
    for index in range(3,len(root)):
        #print index
        row=[]
        for tagindex,val in enumerate(tags):
            searchQuery = "tags"+tags[tagindex]
#             print searchQuery
#             print root[index]
#             print root[index].find(searchQuery).text
            if (root[index].find(searchQuery) is None) or (root[index].find(searchQuery).text == None):
                row.extend([""])
                #print tags[tagindex]+" blank"
            else:
                row.extend([root[index].find(searchQuery).text])
                #print tags[tagindex]+" "+root[index].find(searchQuery).text
        writer.writerow(row)


    #for i,child in enumerate(root):
        #print root[i]
    print '\nNumber of elements is: %s' % len(root)

print '\nTotal run time: %s seconds' % (time.time() - start_time)

if __name__ == "__main__":
    main(sys.argv[1])
6
  • 1
    Have you tried cElementTree (C implementation)? Just replace your ET import statement by: import xml.etree.cElementTree as ET Commented Jun 9, 2014 at 18:53
  • Such a simple fix, this seems to use an incredibly smaller amount of memory. Please respond as an answer so I can accept it. Commented Jun 9, 2014 at 19:25
  • This doesn't answer to the question, which was to know how to read the XML data in chunks, not to load the full file in memory. That said, it's good to know the C implementation is also more efficient in term of memory consumption. Commented Jun 9, 2014 at 19:30
  • While it doesn't answer the question it pretty clearly solves the problem I was having. Commented Jun 9, 2014 at 19:32
  • 1
    Happy to know your problem is solved. Commented Jun 9, 2014 at 19:34

3 Answers 3

3

Few hints:

  • use lxml, it is very performant
  • use iterparse which can process your document piece by piece

However, iterparse can surprise you and you might end up with high memory consumption. To overcome this trouble, you have to clear references to already processed items as described in my favourite article about effective lxml usage

Sample script fastiterparse.py using optimized iterparse

Install docopt and lxml

$ pip install lxml docopt

Write the script:

"""For all elements with given tag prints value of selected attribute
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h
"""
from lxml import etree
from functools import partial

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def printattname(elem, attname):
    print elem.attrib[attname]

def main(fname, tag, attname):

    fun = partial(printattname, attname=attname)
    with open(fname) as f:
        context = etree.iterparse(f, events=("end",), tag=tag)
        fast_iter(context, fun)

if __name__ == "__main__":
    from docopt import docopt
    args = docopt(__doc__)
    main(args["<xmlfile>"], args["<tag>"], args["<attname>"])

Try to call it:

$ python fastiterparse.py                                               
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h

Use it (on your file):

$ python fastiterparse.py large.xml ElaboratedRecord id
rec26872
rec25887
rec26873
rec26874

Conclusion (use the fast_iter approach)

Main takeaway is the fast_iter function (or at least remembering to clear unused elements, delete them and finally delete the context

Measurement can show, that in some cases the script runs a bit slower, then without clear and del, but the difference is not significant. The advantage comes at the moment memory is the limitation as when it starts swapping, optimized version will become faster, and if one runs out of memory, there are not many other options.

Sign up to request clarification or add additional context in comments.

Comments

3

Use cElementTree instead of ElementTree.

Replace your ET import statement by: import xml.etree.cElementTree as ET

Comments

1

Use ElementTree.iterparse to parse your XML data. See documentation for help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.