Running out of memory using python ElementTree

Question

EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.

This works on files up to about 600mb in size, larger than that and I run out of memory (I have a 16gb machine). What can I do to read in a file in pieces, or read in a certain percentage of the xml at a time or is there a less memory intensive approach?

import csv
import xml.etree.ElementTree as ET
from lxml import etree
import time
import sys

def main(argv):
    start_time = time.time()

#file_name = 'sample.xml'
file_name = argv
root = ET.ElementTree(file=file_name).getroot() 
csv_file_name = '.'.join(file_name.split('.')[:-1]) + ".txt"
print '\n'
print 'Output file:'
print csv_file_name

with open(csv_file_name, 'w') as file_:
    writer = csv.writer(file_, delimiter="\t")
    header = [ <the names of the tags here> ]
    writer.writerow(header)
    tags = [
        <bunch of xml tags here>    
            ]

    #write the values
#     for index in range(8,1000):
    for index in range(3,len(root)):
        #print index
        row=[]
        for tagindex,val in enumerate(tags):
            searchQuery = "tags"+tags[tagindex]
#             print searchQuery
#             print root[index]
#             print root[index].find(searchQuery).text
            if (root[index].find(searchQuery) is None) or (root[index].find(searchQuery).text == None):
                row.extend([""])
                #print tags[tagindex]+" blank"
            else:
                row.extend([root[index].find(searchQuery).text])
                #print tags[tagindex]+" "+root[index].find(searchQuery).text
        writer.writerow(row)


    #for i,child in enumerate(root):
        #print root[i]
    print '\nNumber of elements is: %s' % len(root)

print '\nTotal run time: %s seconds' % (time.time() - start_time)

if __name__ == "__main__":
    main(sys.argv[1])

Have you tried cElementTree (C implementation)? Just replace your ET import statement by: import xml.etree.cElementTree as ET — igortg
– igortg, Commented Jun 9, 2014 at 18:53
Such a simple fix, this seems to use an incredibly smaller amount of memory. Please respond as an answer so I can accept it. — Aman Chawla
– Aman Chawla, Commented Jun 9, 2014 at 19:25
This doesn't answer to the question, which was to know how to read the XML data in chunks, not to load the full file in memory. That said, it's good to know the C implementation is also more efficient in term of memory consumption. — mguijarr
– mguijarr, Commented Jun 9, 2014 at 19:30
While it doesn't answer the question it pretty clearly solves the problem I was having. — Aman Chawla
– Aman Chawla, Commented Jun 9, 2014 at 19:32

Jan Vlcinsky · Accepted Answer · 2014-06-09 22:41:30Z

Few hints:

use lxml, it is very performant
use iterparse which can process your document piece by piece

However, iterparse can surprise you and you might end up with high memory consumption. To overcome this trouble, you have to clear references to already processed items as described in my favourite article about effective lxml usage

Sample script `fastiterparse.py` using optimized `iterparse`

Install docopt and lxml

$ pip install lxml docopt

Write the script:

"""For all elements with given tag prints value of selected attribute
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h
"""
from lxml import etree
from functools import partial

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def printattname(elem, attname):
    print elem.attrib[attname]

def main(fname, tag, attname):

    fun = partial(printattname, attname=attname)
    with open(fname) as f:
        context = etree.iterparse(f, events=("end",), tag=tag)
        fast_iter(context, fun)

if __name__ == "__main__":
    from docopt import docopt
    args = docopt(__doc__)
    main(args["<xmlfile>"], args["<tag>"], args["<attname>"])

Try to call it:

$ python fastiterparse.py                                               
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h

Use it (on your file):

$ python fastiterparse.py large.xml ElaboratedRecord id
rec26872
rec25887
rec26873
rec26874

Conclusion (use the `fast_iter` approach)

Main takeaway is the fast_iter function (or at least remembering to clear unused elements, delete them and finally delete the context

Measurement can show, that in some cases the script runs a bit slower, then without clear and del, but the difference is not significant. The advantage comes at the moment memory is the limitation as when it starts swapping, optimized version will become faster, and if one runs out of memory, there are not many other options.

Aman Chawla · Accepted Answer · 2014-06-13 15:27:51Z

3

Use cElementTree instead of ElementTree.

Replace your ET import statement by: import xml.etree.cElementTree as ET

answered Jun 13, 2014 at 15:27

Aman Chawla

7242 gold badges9 silver badges26 bronze badges

Comments

mguijarr · Accepted Answer · 2014-06-09 18:20:25Z

1

Use ElementTree.iterparse to parse your XML data. See documentation for help.

answered Jun 9, 2014 at 18:20

mguijarr

8,0306 gold badges50 silver badges77 bronze badges

Collectives™ on Stack Overflow

Running out of memory using python ElementTree

EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.

3 Answers 3

Sample script `fastiterparse.py` using optimized `iterparse`

Conclusion (use the `fast_iter` approach)

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.

3 Answers 3

Sample script fastiterparse.py using optimized iterparse

Conclusion (use the fast_iter approach)

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Sample script `fastiterparse.py` using optimized `iterparse`

Conclusion (use the `fast_iter` approach)