2

I would like to retrieve the content of a specific element within an XML file. However, within the XML element, there are other XML elements, which destroy the proper extraction of the content within the parent tag. An example:

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
  print element.text

which results in:

a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None

However, e.g., 'a protective uniform for use ..' is missed. It seems, that every element of 'claim-text', which has other inner-elements, is neglected. How should I change the parsing of the XML in order to fetch all claims?

Thanks

I've just solved it with an 'ordinary' SAX parser approach:

class SimpleXMLHandler(object):

  def __init__(self):
    self.buffer = ''
    self.claim = 0

  def start(self, tag, attributes):
    if tag == 'claim-text':
      if self.claim == 0:
        self.buffer = ''
      self.claim = 1

  def data(self, data):
    if self.claim == 1:
      self.buffer += data

  def end(self, tag):
    if tag == 'claim-text':
      print self.buffer
      self.claim = 0

  def close(self):
    pass

1 Answer 1

3

You could use an xpath to find and concatenate all the text nodes directly under each <claim-text> node, like this:

from StringIO import StringIO
from lxml import etree
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text')
for event, element in context:
  print ''.join(element.xpath('text()'))

which outputs:

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising:  
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.