2

I have a document with the following data:

<div class="ds-list">
    <b>1. </b> 
    A domesticated carnivorous mammal 
    <i>(Canis familiaris)</i> 
    related to the foxes and wolves and raised in a wide variety of breeds.
</div>

And I want to get everything within the class ds-list (without <b> and <i> tags). Currently my code is doc.cssselect('div.ds-list'), but all this picks up is the newline before the <b>. How can I get this to do what I want it to?

2 Answers 2

10

Perhaps you are looking for the text_content method?:

import lxml.html as lh
content='''\
<div class="ds-list">
    <b>1. </b> 
    A domesticated carnivorous mammal 
    <i>(Canis familiaris)</i> 
    related to the foxes and wolves and raised in a wide variety of breeds.
</div>'''
doc=lh.fromstring(content)
for div in doc.cssselect('div.ds-list'):
    print(div.text_content())

yields

1.  
A domesticated carnivorous mammal 
(Canis familiaris) 
related to the foxes and wolves and raised in a wide variety of breeds.
Sign up to request clarification or add additional context in comments.

Comments

1
doc.cssselect("div.ds-list").text_content()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.