2

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

When I run it, I get something similar to:

[]
['description1']
[]
['description2']

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?

1 Answer 1

4

The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html). If you want a complex filtering you should do by it's own.

A solution: registering for start event also:

iterparse(self, source, events=("start", "end",), tag="item")

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.