0

i'm trying to build a web scraper to get prices off http://fetch.co.uk/dogs/dog-food?per-page=20

I have the code here below:

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

In every wrap, there are sometimes 2 different prices and I am trying to exclude the cut price and get only the price below that one (the promo price).

i cannot figure out how to exclude the price with cut, the expression above does not work.

"shelf-product__price shelf-product__price--cut [ v2 ]"
"shelf-product__price shelf-product__price--promo [ v2 ]"

I have used the workaround below but i'd like to understand what i am getting wrong in the regular expression. sorry if the code is not pretty, i'm learning

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())
2
  • the mentioned URl doesn't seems to have any element with itemprop="shelf-product__price shelf-product__price--cut [ v2 ]" the value for itemprop is either title or price. That's why the second regex with price.* is working. Commented Jan 24, 2016 at 14:13
  • @mchackam: it's indeed the class attribute and not the itemprop attribute, but it isn't the only problem. When an attribute has several values separated by spaces, the condition is tested on each value separately until one succeeds (and not on the whole attribute). In any cases, the regex is wrong and using regex isn't the good approach here, it's easier to use a function as condition. As an aside putting a pattern compilation in a loop will slow down the code. Commented Jan 24, 2016 at 14:57

2 Answers 2

1

There are a few problems. The first is that .*(?!cut).* is equivalent to .*. This is because the first .* consumes all of the remaining characters. Then of course the (?!cut) check passes since it's at the end of the string. Finally .* consumes 0 characters. So it's always a match. This regex would give you false positives. The only reason it gives you nothing is that you are looking in itemprop when the text you're looking for is in class.

Your workaround looks good to me. But if you wanted to base your search on classes I would do it like this.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20')
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": "shelf-product__self"})

def is_price(tag):
    return tag.has_attr('class') and \
           'shelf-product__price' in tag['class'] and \
           'shelf-product__price--cut' not in tag['class']

for wrap in wrapList:
    print(wrap.find(is_price).text)
    x=wrap.find("",{"class": "shelf-product__title"}).get_text()

Regular expressions are fine but I think it's easier to do boolean logic with booleans.

Sign up to request clarification or add additional context in comments.

1 Comment

You can avoid the first regex too.
0

Why to use that complex code you may try below- span[itemprop=price] means select all span that have properties itemprop is price.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

#get possible list of urls
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)]

for url in urls:
  html = urlopen(url)
  bsObj = BeautifulSoup(html,"html.parser")
  for y in [i.text for i in bsObj.select("span[itemprop=price]")]:
    print y.encode('utf-8')

1 Comment

It seems reasonable to use select but there are a few issues with the code. It uses python2 where the question uses python3. It tries different per-page values and i'm not sure why (it's not a page number). respons.content should be html. [t for t in ..] does nothing. Also the price should be associated with the product name. That last point might prevent you from using select.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.