Parsing Large XML file with Python lxml and Iterparse

Question

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

When I run it, I get something similar to:

[]
['description1']
[]
['description2']

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?

Nicolae Dascalu · Accepted Answer · 2011-08-25 00:28:23Z

4

The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html). If you want a complex filtering you should do by it's own.

A solution: registering for start event also:

iterparse(self, source, events=("start", "end",), tag="item")

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.

answered Aug 25, 2011 at 0:28

Nicolae Dascalu

3,5452 gold badges21 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing Large XML file with Python lxml and Iterparse

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related