I want to parse the xml file as following
<book attr='1'>
<page number='1'>
<text> sss </text>
<text> <b>bb<i>sss<b></i></b></text>
<text> <i><b>sss</b></i></text>
<text><a herf='a'> sss</a></text>
</page>
<page number='2'>
<text> sss2 </text>
<text> <b>bb<i>sss2</i><b></text>
<text> <i><b>sss2</b></i></text>
<text><a herf='a'> sss2</a></text>
</page>
.......
</book>
I want to extract all the text between the 'text' element. But there are 'b' 'i' 'a' elements et al., in between the 'text' element. I have tried to use the following code.
tree = ET.parse('book.xml')
root = tree.getroot()
for p in root.findall('page'):
print(p.get('number'))
for t in p.findall('text'):
print(t.text)
But the result:
1
sss
None
None
None
2
sss2
None
None
None
Actually, I want to extract all the text between the and , and join to be sentence like the following:
1
bb sss
sss
sss
sss
2
bb sss2
sss2
sss2
sss2
But how to parse the subelement between the 'text' thanks!