3

I'm dealing with a tricky CSS selector problem that involve multiple nested spans.

(A) Normally the HTML/CSS look like this:

<div class="pricing">
    <strong>1 200 €</strong> 
</div>

(B) But there are also parts that look like this:

<div class="pricing">
    <strong>
        <span class="promotion">
            <span class="promo-price">1 100 €</span>
        </span>
        <span class="strike">
            <span>1 200€</span>
        </span>
    </strong>
    <div class="new">New supplier</div>
</div>

(C) and like this:

<div class="pricing">
    <strong>3 400 €</strong> 
    <span>/ best:  4500.00 €</span>
</div>

(D) and like this:

<div class="pricing">
    <strong>4 900 €</strong> 
    <span class="netto">+ taxes</span> 
    <span>/ best:  4900.00 €</span>
</div>

Using a Scrapy CSS selector of the type:

response.css("div.pricing strong ::text").extract()
# ['2 500 €', '\n    ', '\n    ', '1 100 €', '\n    ', '\n    ', '1 200€', '3 999 €',...]

This show that the problematic <span ...> of the above CSS, adds whitespace in the selector text. So I tried to ignore both the strike and promotion classes with various variations of using :not() like this:

response.css("div.pricing strong:not([class*='promotion']):not([class*='strike'])::text").extract()
# <same result as above>

I can also get the promo-price only, with:

response.css("div.pricing  .promo-price::text").extract()
# ['1 100 €']

At this point I'm at loss on how to:

  • get all the (A) prices
  • get all the (B) promo-prices (only)
  • result without the introduced white space (as shown above)
  • all of the above in (preferably) one CSS selector or line

Q: How can I do this in the simplest possible manner?


Note: I have already seen the similar questions:

But they did not offer much help in my case.


UPDATE:

I was not able to complete the task according to @boltclock's instructions and ended up with an ugly hack, like this:

adPrice = aditem.css("div.pricing strong::text").extract_first().strip()
if adPrice == '':
    adPrice = aditem.css("div.pricing span.promo-price::text").extract_first()

So if someone has a better or more elegant solution...

1 Answer 1

2

Hmm.

Does that div.new only appear after a strong that contains all that complexity (B), and never after a strong that contains just a single price (A)?

If so:

  • get all the (A) prices
  • result without the introduced white space (as shown above)
response.css("div.pricing strong:only-child::text").extract()

Notice the omission of the space before ::text, which ensures you only get the text that's directly in the strong — see the end of my answer to this question for usage guidelines.

:only-child ensures that it doesn't match when a div.new is present, if its absence implies (A), so you never have to worry about (B).

  • get all the (B) promo-prices (only)
response.css("div.pricing .promo-price::text").extract()
  • all of the above in (preferably) one CSS selector or line

At this point, it should be a simple matter of grouping the above two selectors:

response.css("div.pricing strong:only-child::text, div.pricing .promo-price::text").extract()

If the div.new is unrelated, it's going to be difficult to do this with CSS selectors since there's no other way to distinguish (A) from (B). XPath on the other hand makes short work of it:

response.xpath("//div[@class='pricing']/(strong[not(./span)]|descendant::span[@class='promo-price'])/text()").extract()
Sign up to request clarification or add additional context in comments.

3 Comments

Correct. The div.new seem to only appear in case(B).
I tried your combo above, and it was missing quite a few items. Investigation showed that there is yet another price version (C), with a span in it, and who knows how many more. (I've updated my post)
I now see why the best way to extract this is using the xpath method. However I tried your line above, but only the first part (before the |) works and actually get all items except the promo ones. The second part generate an error message.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.