Python: Extract HTML from an XML file

Question

My XML file looks like this:

 <strings>
      <string>Bla <b>One &amp; Two</b> Foo</string>
 </strings>

I want to extract the content of each <string> while maintaining the inner tags. That is, I would like to see the following Python string: u"Bla <b>One & Two</b> Foo". Alternatively, I guess I could settle on u"Bla <b>One & Two</b> Foo", and then try to replace the entities myself.

I am currently using lxml, which allows me to iterate over the nested tags, missing out on the text not inside a tag, or alternatively over all text content (itertext), losing the tag information. I'm probably missing something.

If possible I'd prefer to keep lxml, though I can switch to another library if necessary.

Robert Rossney · Accepted Answer · 2009-11-29 18:28:44Z

3

There may be a better way of conditionally handling objects returned by the xpath() function, but I'm not sufficiently conversant with lxml to know what it is, so I had to write a function to return the text value of a node. But that said, this shows a general approach to the problem:

>>> from lxml import etree
>>> from StringIO import StringIO
>>> def node_text(n):
        try:
            return etree.tostring(n, method='html', with_tail=False)
        except TypeError:
            return str(n)

>>> f = StringIO('<strings><string>This is <b>not</b> how I plan to escape.</string></strings>')
>>> x = etree.parse(f)
>>> ''.join(node_text(n) for n in x.xpath('/strings/string/node()'))
'This is <b>not</b> how I plan to escape.'

answered Nov 29, 2009 at 18:28

Robert Rossney

97.3k24 gold badges150 silver badges218 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

miracle2k Over a year ago

It turns out that instead of using node(), one could also child.iterdescendants(), but thanks for pointing me in the right direction.

cobbal · Accepted Answer · 2009-11-29 15:15:57Z

2

try etree.tostring

outer = etree.tostring(string_elem, method='html')
inner = re.match("^[^>]+>(.*)<[^<]+$", outer).groups(1)[0]

edited Nov 29, 2009 at 15:15

answered Nov 29, 2009 at 7:42

cobbal

70.9k20 gold badges146 silver badges159 bronze badges

4 Comments

miracle2k Over a year ago

I know about tostring actually, but that includes the string-tag itself.

cobbal Over a year ago

wouldn't be that hard to trim out manually, a simple regex would work

Robert Rossney Over a year ago

+1 for finding one of the rare cases where using a regular expression to process XML isn't a terrible, terrible idea.

miracle2k Over a year ago

Thanks guys. I guess stripping out the surrounding tag would work well enough also.

wizzard0 · Accepted Answer · 2009-11-29 08:04:38Z

0

Regardless of the language, relatively simple XSLT template would do the trick.

Something like defining patterns to tags you want to keep, converting to text others.

You can of course use a recursive function with a compliant DOM implementation (minidom maybe?) and process tags by hand.

(pseudocode)

def Function(tag):
   if tag.NodeType = "#text": return tag.innerText
   text=""
   if tag.ElementName in allowedTags:
       text="<%s>"%tag.ElementName
   text += [Function(subtag) for subtag in tag.childs]
   if tag.ElementName in allowedTags:
       text+="</%s>"%tag.ElementName
   return text

answered Nov 29, 2009 at 8:04

wizzard0

1,9381 gold badge17 silver badges39 bronze badges

Comments

ghostdog74 · Accepted Answer · 2009-11-29 08:39:39Z

-1

Not using parser, but just pure string manipulation

mystring="""
 <strings>
      <string>Bla <b>One &amp; Two</b> Foo</string>
 </strings>
"""
for s in mystring.split("</string>"):
    if "<string>" in s:
        i = s.index("<string>")
        print s[i+len("<string>"):].replace("&amp;","")

answered Nov 29, 2009 at 8:39

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

2 Comments

Robert Rossney Over a year ago

This list of things wrong with this approach is not short: among others, it fails if there's an empty <string/> element, or if any <string> element contains attributes, or whitespace in its opening or closing tag, or if any text node contains character entities or CDATA.

ghostdog74 Over a year ago

you are assuming too much!!. And that's a bad habit.

Collectives™ on Stack Overflow

Python: Extract HTML from an XML file

4 Answers 4

1 Comment

4 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related