How to use multiple and nested span CSS selectors in Scrapy?

Question

I'm dealing with a tricky CSS selector problem that involve multiple nested spans.

(A) Normally the HTML/CSS look like this:

<div class="pricing">
    <strong>1 200 €</strong> 
</div>

(B) But there are also parts that look like this:

<div class="pricing">
    <strong>
        <span class="promotion">
            <span class="promo-price">1 100 €</span>
        </span>
        <span class="strike">
            <span>1 200€</span>
        </span>
    </strong>
    <div class="new">New supplier</div>
</div>

(C) and like this:

<div class="pricing">
    <strong>3 400 €</strong> 
    <span>/ best:  4500.00 €</span>
</div>

(D) and like this:

<div class="pricing">
    <strong>4 900 €</strong> 
    <span class="netto">+ taxes</span> 
    <span>/ best:  4900.00 €</span>
</div>

Using a Scrapy CSS selector of the type:

response.css("div.pricing strong ::text").extract()
# ['2 500 €', '\n    ', '\n    ', '1 100 €', '\n    ', '\n    ', '1 200€', '3 999 €',...]

This show that the problematic <span ...> of the above CSS, adds whitespace in the selector text. So I tried to ignore both the strike and promotion classes with various variations of using :not() like this:

response.css("div.pricing strong:not([class*='promotion']):not([class*='strike'])::text").extract()
# <same result as above>

I can also get the promo-price only, with:

response.css("div.pricing  .promo-price::text").extract()
# ['1 100 €']

At this point I'm at loss on how to:

get all the (A) prices
get all the (B) promo-prices (only)
result without the introduced white space (as shown above)
all of the above in (preferably) one CSS selector or line

Q: How can I do this in the simplest possible manner?

Note: I have already seen the similar questions:

But they did not offer much help in my case.

UPDATE:

I was not able to complete the task according to @boltclock's instructions and ended up with an ugly hack, like this:

adPrice = aditem.css("div.pricing strong::text").extract_first().strip()
if adPrice == '':
    adPrice = aditem.css("div.pricing span.promo-price::text").extract_first()

So if someone has a better or more elegant solution...

BoltClock · Accepted Answer · 2018-09-25 11:26:07Z

2

Hmm.

Does that div.new only appear after a strong that contains all that complexity (B), and never after a strong that contains just a single price (A)?

If so:

get all the (A) prices

result without the introduced white space (as shown above)

response.css("div.pricing strong:only-child::text").extract()

Notice the omission of the space before ::text, which ensures you only get the text that's directly in the strong — see the end of my answer to this question for usage guidelines.

:only-child ensures that it doesn't match when a div.new is present, if its absence implies (A), so you never have to worry about (B).

get all the (B) promo-prices (only)

response.css("div.pricing .promo-price::text").extract()

all of the above in (preferably) one CSS selector or line

At this point, it should be a simple matter of grouping the above two selectors:

response.css("div.pricing strong:only-child::text, div.pricing .promo-price::text").extract()

If the div.new is unrelated, it's going to be difficult to do this with CSS selectors since there's no other way to distinguish (A) from (B). XPath on the other hand makes short work of it:

response.xpath("//div[@class='pricing']/(strong[not(./span)]|descendant::span[@class='promo-price'])/text()").extract()

edited Sep 25, 2018 at 11:26

answered Sep 25, 2018 at 11:13

BoltClock

728k165 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

not2qubit Over a year ago

Correct. The div.new seem to only appear in case(B).

not2qubit Over a year ago

I tried your combo above, and it was missing quite a few items. Investigation showed that there is yet another price version (C), with a span in it, and who knows how many more. (I've updated my post)

not2qubit Over a year ago

I now see why the best way to extract this is using the xpath method. However I tried your line above, but only the first part (before the |) works and actually get all items except the promo ones. The second part generate an error message.

Collectives™ on Stack Overflow

How to use multiple and nested span CSS selectors in Scrapy?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related