How parse the text between the element of xml file in python

Question

I want to parse the xml file as following

  <book attr='1'>
  <page number='1'>
   <text> sss  </text>
   <text> <b>bb<i>sss<b></i></b></text>
   <text> <i><b>sss</b></i></text>
   <text><a herf='a'> sss</a></text>
  </page>
  <page number='2'>
   <text> sss2  </text>
   <text> <b>bb<i>sss2</i><b></text>
   <text> <i><b>sss2</b></i></text>
   <text><a herf='a'> sss2</a></text>
  </page>
   .......
  </book>

I want to extract all the text between the 'text' element. But there are 'b' 'i' 'a' elements et al., in between the 'text' element. I have tried to use the following code.

tree = ET.parse('book.xml')
root = tree.getroot()
for p in root.findall('page'):
    print(p.get('number'))
    for t in p.findall('text'):
        print(t.text)

But the result:

 1
 sss
 None
 None
 None
  2
 sss2
 None
 None
 None

Actually, I want to extract all the text between the and , and join to be sentence like the following:

  1
 bb sss
 sss
 sss
 sss
  2
 bb sss2
 sss2
 sss2
 sss2

But how to parse the subelement between the 'text' thanks!

Andrej Kesely · Accepted Answer · 2019-07-29 04:27:00Z

1

For parsing XML you can use BeautifulSoup. The text between elements can be obtained with get_text() method:

data = '''<book attr='1'>
  <page number='1'>
   <text> sss  </text>
   <text> <b>bb<i>sss<b></i></b></text>
   <text> <i><b>sss</b></i></text>
   <text><a herf='a'> sss</a></text>
  </page>
  <page number='2'>
   <text> sss2  </text>
   <text> <b>bb<i>sss2</i><b></text>
   <text> <i><b>sss2</b></i></text>
   <text><a herf='a'> sss2</a></text>
  </page>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for page in soup.select('page[number]'):
    print(page['number'])
    for text in page.select('text'):
        print(text.get_text(strip=True, separator=' '))

Prints:

1
sss
bb sss
sss
sss
2
sss2
bb sss2
sss2
sss2

answered Jul 29, 2019 at 4:27

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tktktk0711 Over a year ago

could you don't use the beautiful soup since there is error when using the beautiful soup:FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Andrej Kesely Over a year ago

@tktktk0711 Make sure you are using the latest version of BeautifulSoup. You can replace lxml parser for html.parser or 'html5lib`

Collectives™ on Stack Overflow

How parse the text between the element of xml file in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related