0

I want to parse the xml file as following

  <book attr='1'>
  <page number='1'>
   <text> sss  </text>
   <text> <b>bb<i>sss<b></i></b></text>
   <text> <i><b>sss</b></i></text>
   <text><a herf='a'> sss</a></text>
  </page>
  <page number='2'>
   <text> sss2  </text>
   <text> <b>bb<i>sss2</i><b></text>
   <text> <i><b>sss2</b></i></text>
   <text><a herf='a'> sss2</a></text>
  </page>
   .......
  </book>

I want to extract all the text between the 'text' element. But there are 'b' 'i' 'a' elements et al., in between the 'text' element. I have tried to use the following code.

tree = ET.parse('book.xml')
root = tree.getroot()
for p in root.findall('page'):
    print(p.get('number'))
    for t in p.findall('text'):
        print(t.text)

But the result:

 1
 sss
 None
 None
 None
  2
 sss2
 None
 None
 None

Actually, I want to extract all the text between the and , and join to be sentence like the following:

  1
 bb sss
 sss
 sss
 sss
  2
 bb sss2
 sss2
 sss2
 sss2

But how to parse the subelement between the 'text' thanks!

1 Answer 1

1

For parsing XML you can use BeautifulSoup. The text between elements can be obtained with get_text() method:

data = '''<book attr='1'>
  <page number='1'>
   <text> sss  </text>
   <text> <b>bb<i>sss<b></i></b></text>
   <text> <i><b>sss</b></i></text>
   <text><a herf='a'> sss</a></text>
  </page>
  <page number='2'>
   <text> sss2  </text>
   <text> <b>bb<i>sss2</i><b></text>
   <text> <i><b>sss2</b></i></text>
   <text><a herf='a'> sss2</a></text>
  </page>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for page in soup.select('page[number]'):
    print(page['number'])
    for text in page.select('text'):
        print(text.get_text(strip=True, separator=' '))

Prints:

1
sss
bb sss
sss
sss
2
sss2
bb sss2
sss2
sss2
Sign up to request clarification or add additional context in comments.

2 Comments

could you don't use the beautiful soup since there is error when using the beautiful soup:FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
@tktktk0711 Make sure you are using the latest version of BeautifulSoup. You can replace lxml parser for html.parser or 'html5lib`

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.