What is a good XML stream parser for Python? [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 6 years ago.

Improve this question

Are there any XML parsers for Python that can parse file streams? My XML files are too big to fit in memory, so I need to parse the stream.

Ideally I wouldn't have to have root access to install things, so lxml is not a very good option.

I have been using xml.etree.ElementTree but I am convinced it is broken.

klactose · Accepted Answer · 2019-11-14 23:23:24Z

23

Here's good answer about xml.etree.ElementTree.iterparse practice on huge XML files. lxml has the method as well. The key to stream parsing with iterparse is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.

Another option is using xml.sax. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module, xml.sax.expatreader, implement incremental parsing interface xml.sax.xmlreader.IncrementalParser. That is to say xml.sax.make_parser() provides suitable stream parser.

For instance, given a XML stream like:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <entry><a>value 0</a><b foo='bar' /></entry>
  <entry><a>value 1</a><b foo='baz' /></entry>
  <entry><a>value 2</a><b foo='quz' /></entry>
  ...
</root>

Can be handled in the following way.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.sax


class StreamHandler(xml.sax.handler.ContentHandler):

  lastEntry = None
  lastName  = None


  def startElement(self, name, attrs):
    self.lastName = name
    if name == 'entry':
      self.lastEntry = {}
    elif name != 'root':
      self.lastEntry[name] = {'attrs': attrs, 'content': ''}

  def endElement(self, name):
    if name == 'entry':
      print({
        'a' : self.lastEntry['a']['content'],
        'b' : self.lastEntry['b']['attrs'].getValue('foo')
      })
      self.lastEntry = None
    elif name == 'root':
      raise StopIteration

  def characters(self, content):
    if self.lastEntry:
      self.lastEntry[self.lastName]['content'] += content


if __name__ == '__main__':
  # use default ``xml.sax.expatreader``
  parser = xml.sax.make_parser()
  parser.setContentHandler(StreamHandler())
  # feed the parser with small chunks to simulate
  with open('data.xml') as f:
    while True:
      buffer = f.read(16)
      if buffer:
        try:
          parser.feed(buffer)
        except StopIteration:
          break
  # if you can provide a file-like object it's as simple as
  with open('data.xml') as f:
    parser.parse(f)

edited Nov 14, 2019 at 23:23

klactose

1,2422 gold badges11 silver badges27 bronze badges

answered Mar 19, 2014 at 11:37

saaj

25.5k6 gold badges116 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

oHo Over a year ago

Thank you Saaj. I have finally found an answer to my own question thanks to you answer. See my more elaborate answer: stackoverflow.com/a/44398623/938111

user128511 Over a year ago

Dumb question but what is the time.sleep(2) for?

saaj Over a year ago

@gman It's a good question in fact. It's hard to remember what was the intention behind else branch. Probably related to experimenting with simulating slow input. But I copy-pasted the snippet and ran it with raise RuntimeError in place of time.sleep call. It ran successfully, so it's a dead branch. Removed it.

Petr Viktorin · Accepted Answer · 2011-10-07 22:45:16Z

12

Are you looking for xml.sax? It's right in the standard library.

answered Oct 7, 2011 at 22:45

Petr Viktorin

67.3k9 gold badges85 silver badges83 bronze badges

Comments

John Machin · Accepted Answer · 2011-10-08 00:39:43Z

1

Use xml.etree.cElementTree. It's much faster than xml.etree.ElementTree. Neither of them are broken. Your files are broken (see my answer to your other question).

answered Oct 8, 2011 at 0:39

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

6 Comments

Aillyn Over a year ago

Indeed, it is much faster. And yes, my files were broken.

mcepl Over a year ago

Guy was asking about streaming parser.

John Machin Over a year ago

@mcepl: Guy wanted to parse huge files; guy can do that with iterparse(). What/where is your answer?

mcepl Over a year ago

Isn't iterparse() building the tree as well (“Note that iterparse still builds a tree, just like parser.” effbot.org/zone/element-iterparse.htm). And my answer was bumping the one by Peter Viktorin.

Marco Over a year ago

Just FYI: in 2019, cElementTree is simply an alias for ElementTree.

|

Collectives™ on Stack Overflow

What is a good XML stream parser for Python? [closed]

3 Answers 3

3 Comments

Comments

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

6 Comments

Linked

Related