I'm dealing with a tricky CSS selector problem that involve multiple nested spans.
(A) Normally the HTML/CSS look like this:
<div class="pricing">
<strong>1 200 €</strong>
</div>
(B) But there are also parts that look like this:
<div class="pricing">
<strong>
<span class="promotion">
<span class="promo-price">1 100 €</span>
</span>
<span class="strike">
<span>1 200€</span>
</span>
</strong>
<div class="new">New supplier</div>
</div>
(C) and like this:
<div class="pricing">
<strong>3 400 €</strong>
<span>/ best: 4500.00 €</span>
</div>
(D) and like this:
<div class="pricing">
<strong>4 900 €</strong>
<span class="netto">+ taxes</span>
<span>/ best: 4900.00 €</span>
</div>
Using a Scrapy CSS selector of the type:
response.css("div.pricing strong ::text").extract()
# ['2 500 €', '\n ', '\n ', '1 100 €', '\n ', '\n ', '1 200€', '3 999 €',...]
This show that the problematic <span ...> of the above CSS, adds whitespace in the selector text. So I tried to ignore both the strike and promotion classes with various variations of using :not() like this:
response.css("div.pricing strong:not([class*='promotion']):not([class*='strike'])::text").extract()
# <same result as above>
I can also get the promo-price only, with:
response.css("div.pricing .promo-price::text").extract()
# ['1 100 €']
At this point I'm at loss on how to:
- get all the (A) prices
- get all the (B)
promo-prices (only) - result without the introduced white space (as shown above)
- all of the above in (preferably) one CSS selector or line
Q: How can I do this in the simplest possible manner?
Note: I have already seen the similar questions:
- Scrapy grab div with multiple classes?
- Using multiple CSS selectors for the same ArticleItem in Scrapy
But they did not offer much help in my case.
UPDATE:
I was not able to complete the task according to @boltclock's instructions and ended up with an ugly hack, like this:
adPrice = aditem.css("div.pricing strong::text").extract_first().strip()
if adPrice == '':
adPrice = aditem.css("div.pricing span.promo-price::text").extract_first()
So if someone has a better or more elegant solution...