i'm trying to build a web scraper to get prices off http://fetch.co.uk/dogs/dog-food?per-page=20
I have the code here below:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")
wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text())
print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())
In every wrap, there are sometimes 2 different prices and I am trying to exclude the cut price and get only the price below that one (the promo price).
i cannot figure out how to exclude the price with cut, the expression above does not work.
"shelf-product__price shelf-product__price--cut [ v2 ]"
"shelf-product__price shelf-product__price--promo [ v2 ]"
I have used the workaround below but i'd like to understand what i am getting wrong in the regular expression. sorry if the code is not pretty, i'm learning
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")
wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text())
print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())
itemprop="shelf-product__price shelf-product__price--cut [ v2 ]"the value foritempropis eithertitleorprice. That's why the second regex withprice.*is working.classattribute and not theitempropattribute, but it isn't the only problem. When an attribute has several values separated by spaces, the condition is tested on each value separately until one succeeds (and not on the whole attribute). In any cases, the regex is wrong and using regex isn't the good approach here, it's easier to use a function as condition. As an aside putting a pattern compilation in a loop will slow down the code.