Parse text of element with empty element inside

Question

I'm trying to convert an XHTML document that uses lots of tables into a semantic XML document in Python using xml.etree. However, I'm having some trouble converting this XHTML

<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>

into something like this

<lines>
  <line>Textline1</line>
  <line>Textline2</line>
  <line>Textline3</line>
</lines>

The problem is that I don't know how to get the text after the BR elements.

EnigmaCurry · Accepted Answer · 2010-06-02 23:57:54Z

1

You need to use the .tail property of the <br> elements.

import xml.etree.ElementTree as et

doc = """<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>
"""

e = et.fromstring(doc)

items = []
for x in e.getiterator():
    if x.text is not None:
        items.append(x.text.strip())
    if x.tail is not None:
        items.append(x.tail.strip())

doc2 = et.Element("lines")
for i in items:
    l=et.SubElement(doc2, "line")
    l.text = i

print(et.tostring(doc2))

edited Jun 2, 2010 at 23:57

answered Jun 2, 2010 at 18:35

EnigmaCurry

5,7272 gold badges25 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

John Machin Over a year ago

aarrgghh use if foo is not None: not if foo != None

EnigmaCurry Over a year ago

Of course you're right John, I normally would. I've just spent the last 9 hours coding Java though so I slipped :(

John Machin Over a year ago

You must have committed a really serious offence to merit such a sentence as 9 hours Java coding.

Drew Sears · Accepted Answer · 2010-06-02 18:19:51Z

0

I don't think the tags being empty is your problem. xml.etree may not expect you to have child elements and bare text nodes mixed together.

BeautifulSoup is great for parsing XML or HTML that isn't well formatted:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(open('in.html').read())
print "\n".join(["<line>%s</line>" % node.strip() for node in soup.find('td').contents if isinstance(node, BeautifulSoup.NavigableString)])

answered Jun 2, 2010 at 18:19

Drew Sears

12.8k1 gold badge34 silver badges41 bronze badges

Collectives™ on Stack Overflow

Parse text of element with empty element inside

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related