-1

I am trying to extract a text from forum posts, however the bold element is ignored.

How can I extract raw data like Some text to extract bold content? Currently I am getting only Some text to extract ?

<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
Some text to extract <b>bold content</b>?
</blockquote>

def parse_page(self, response):
    for quote in response.css('article'):
        yield {
            'text': quote.css('blockquote::text').extract()
        }

2 Answers 2

1

You need a space in your css selector:

'blockquote ::text'
           ^

Because you want text of every descending node under blockquote, without space it means just the text of blockquote node.

Sign up to request clarification or add additional context in comments.

5 Comments

The not selector will stop working with the space? blockquote:not(.bbCodeBlock) ::text Apparently yes.
@anvd just tested, it should and does works fine. Tested: 'blockquote:not(.foo) ::text'
the markup is a bit more complicated, and it will not work as expected jsfiddle.net/dwfmLcaj
@anvd This is not javascript. Scrapy converts all css selectors to xpath so the only css selector implementation that matters here is cssselect package, see: github.com/scrapy/cssselect.
thanks for the link, but currently the problem is the css. I don't even know how select that part of text that don't have any element associated. The problem is css for now
1

Use * selector to select text of all inner elements inside an element.

''.join([ a.strip() for a in quote.css('blockquote *::text').extract() ])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.