python: exclude string regular expression

Question

i'm trying to build a web scraper to get prices off http://fetch.co.uk/dogs/dog-food?per-page=20

I have the code here below:

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

In every wrap, there are sometimes 2 different prices and I am trying to exclude the cut price and get only the price below that one (the promo price).

i cannot figure out how to exclude the price with cut, the expression above does not work.

"shelf-product__price shelf-product__price--cut [ v2 ]"
"shelf-product__price shelf-product__price--promo [ v2 ]"

I have used the workaround below but i'd like to understand what i am getting wrong in the regular expression. sorry if the code is not pretty, i'm learning

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

the mentioned URl doesn't seems to have any element with itemprop="shelf-product__price shelf-product__price--cut [ v2 ]" the value for itemprop is either title or price. That's why the second regex with price.* is working. — mchackam
– mchackam, Commented Jan 24, 2016 at 14:13
@mchackam: it's indeed the class attribute and not the itemprop attribute, but it isn't the only problem. When an attribute has several values separated by spaces, the condition is tested on each value separately until one succeeds (and not on the whole attribute). In any cases, the regex is wrong and using regex isn't the good approach here, it's easier to use a function as condition. As an aside putting a pattern compilation in a loop will slow down the code. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jan 24, 2016 at 14:57

Trevor Merrifield · Accepted Answer · 2016-01-24 15:58:12Z

1

There are a few problems. The first is that .*(?!cut).* is equivalent to .*. This is because the first .* consumes all of the remaining characters. Then of course the (?!cut) check passes since it's at the end of the string. Finally .* consumes 0 characters. So it's always a match. This regex would give you false positives. The only reason it gives you nothing is that you are looking in itemprop when the text you're looking for is in class.

Your workaround looks good to me. But if you wanted to base your search on classes I would do it like this.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20')
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": "shelf-product__self"})

def is_price(tag):
    return tag.has_attr('class') and \
           'shelf-product__price' in tag['class'] and \
           'shelf-product__price--cut' not in tag['class']

for wrap in wrapList:
    print(wrap.find(is_price).text)
    x=wrap.find("",{"class": "shelf-product__title"}).get_text()

Regular expressions are fine but I think it's easier to do boolean logic with booleans.

edited Jan 24, 2016 at 15:58

answered Jan 24, 2016 at 15:40

Trevor Merrifield

4,7212 gold badges23 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Casimir et Hippolyte Over a year ago

You can avoid the first regex too.

Learner · Accepted Answer · 2016-01-24 18:38:11Z

0

Why to use that complex code you may try below- span[itemprop=price] means select all span that have properties itemprop is price.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

#get possible list of urls
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)]

for url in urls:
  html = urlopen(url)
  bsObj = BeautifulSoup(html,"html.parser")
  for y in [i.text for i in bsObj.select("span[itemprop=price]")]:
    print y.encode('utf-8')

edited Jan 24, 2016 at 18:38

answered Jan 24, 2016 at 16:30

Learner

5,3001 gold badge29 silver badges39 bronze badges

1 Comment

Trevor Merrifield Over a year ago

It seems reasonable to use select but there are a few issues with the code. It uses python2 where the question uses python3. It tries different per-page values and i'm not sure why (it's not a page number). respons.content should be html. [t for t in ..] does nothing. Also the price should be associated with the product name. That last point might prevent you from using select.

Collectives™ on Stack Overflow

python: exclude string regular expression

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related