Python LXML iterparse with nested elements

Question

I would like to retrieve the content of a specific element within an XML file. However, within the XML element, there are other XML elements, which destroy the proper extraction of the content within the parent tag. An example:

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
  print element.text

which results in:

a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None

However, e.g., 'a protective uniform for use ..' is missed. It seems, that every element of 'claim-text', which has other inner-elements, is neglected. How should I change the parsing of the XML in order to fetch all claims?

Thanks

I've just solved it with an 'ordinary' SAX parser approach:

class SimpleXMLHandler(object):

  def __init__(self):
    self.buffer = ''
    self.claim = 0

  def start(self, tag, attributes):
    if tag == 'claim-text':
      if self.claim == 0:
        self.buffer = ''
      self.claim = 1

  def data(self, data):
    if self.claim == 1:
      self.buffer += data

  def end(self, tag):
    if tag == 'claim-text':
      print self.buffer
      self.claim = 0

  def close(self):
    pass

jsw · Accepted Answer · 2011-04-21 01:00:28Z

You could use an xpath to find and concatenate all the text nodes directly under each <claim-text> node, like this:

from StringIO import StringIO
from lxml import etree
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text')
for event, element in context:
  print ''.join(element.xpath('text()'))

which outputs:

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising:  
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;

Collectives™ on Stack Overflow

Python LXML iterparse with nested elements

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related