I am writing a simple scraper to pull flight prices from Kayak - I am scraping multiple data items (duration, airline, price etc) using XPath and storing each in a list of 15 values (# of results on a Kayak page).
My problem is that the "price" variable scrape returns more than 15 values because in addition to the "best" result it also pulls the additional displayed results (see screenshot - large font on RHS vs. two offers in bottom LHS).
I've narrowed down the problem to the following:
1) Overall (working) XPath to pull both values is:
'//a[@class="booking-link "]/span[@class="price option-text"]/span[@class = "price-text"]'
2) The key to distinguish the main price from the additional price lies in the @id string, where the @id for both types of prices is
- (i) partly randomly generated,
- (ii) contains "-price-text" in both cases and
(iii) contains "extra-info" only in the additional price,
e.g.:
- Main price: //*[@id="pck6-mb-aE-1d84916e1b2-price-text"]
- Additional price: //*[@id="NB5A-extra-info-hmb-tE-15ae5bd2e33-price-text"]
How do I write an XPath which pulls only the main prices, i.e. filters out any XPaths which contain the "extra-info" string in the @id? I've tried several ways (examples below) but can't seem to get the syntax right. Any help appreciated, thanks!
Examples tried:
'//a[@class="booking-link "]/span[@class="price option-text"]/span[@class = "price-text" and not[contains(@id,"extra-info")]]'
'//a[@class="booking-link "]//span[@class="price option-text"]//span[[not[contains(@id,"extra-info")]//span[contains(@id,"-price-text")]]'
'//a[@class="booking-link "]/span[@class="price option-text"]/span[len(@id==33)]'