1

I'm trying to convert an XHTML document that uses lots of tables into a semantic XML document in Python using xml.etree. However, I'm having some trouble converting this XHTML

<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>

into something like this

<lines>
  <line>Textline1</line>
  <line>Textline2</line>
  <line>Textline3</line>
</lines>

The problem is that I don't know how to get the text after the BR elements.

2 Answers 2

1

You need to use the .tail property of the <br> elements.

import xml.etree.ElementTree as et

doc = """<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>
"""

e = et.fromstring(doc)

items = []
for x in e.getiterator():
    if x.text is not None:
        items.append(x.text.strip())
    if x.tail is not None:
        items.append(x.tail.strip())

doc2 = et.Element("lines")
for i in items:
    l=et.SubElement(doc2, "line")
    l.text = i

print(et.tostring(doc2))
Sign up to request clarification or add additional context in comments.

3 Comments

aarrgghh use if foo is not None: not if foo != None
Of course you're right John, I normally would. I've just spent the last 9 hours coding Java though so I slipped :(
You must have committed a really serious offence to merit such a sentence as 9 hours Java coding.
0

I don't think the tags being empty is your problem. xml.etree may not expect you to have child elements and bare text nodes mixed together.

BeautifulSoup is great for parsing XML or HTML that isn't well formatted:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(open('in.html').read())
print "\n".join(["<line>%s</line>" % node.strip() for node in soup.find('td').contents if isinstance(node, BeautifulSoup.NavigableString)])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.