Scrapy scraping nested text using css selectors

Question

I have the following html code:

<div class='article'>
<p>Lorem <strong>ipsum</strong> si ammet</p>
</div>

So to get the text data as: Lorem ipsum si ammet, so I tried to use:

response.css('div.article >p::text ').extract()

But I only receive only lorem sie ammet.

How can I get both <p> and <strong> texts using CSS selectors?

Doesn't look like a duplicate to me. This question asks for a way specifically using CSS selectors, while the other one only mentions XPath selectors. — Attila
– Attila, Commented Apr 25, 2018 at 22:51

Umair Ayub · Accepted Answer · 2018-03-27 15:21:16Z

4

One liner solution.

"".join(a.strip() for a in response.css("div.article *::text").extract())

div.article * means to scrape everything inside the div.article

Or an easy way to write it

text = ""
for a in response.css("div.article *::text").extract()
    text += a.strip()

Both approaches are same,

answered Mar 27, 2018 at 15:21

Umair Ayub

21.7k14 gold badges82 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

i-naeem · Accepted Answer · 2022-11-14 10:58:13Z

1

In Scrapy 2.7+ you can do so with following

text = response.css('div.article *::text').getall()
text = [t.strip() for t in text]
text = "".join(text)

getall() method returns list.

answered Nov 14, 2022 at 10:58

i-naeem

415 bronze badges