1

I'm trying to request the following URL:

https://www.sainsburys.co.uk/shop/gb/groceries/shiraz/barossa-valley-estate-grenache-shiraz-mourv%C3%A8dre-75cl

Decoding it with urllib and printing it reveals it to be:

In [36]: print urllib.unquote(url)
https://www.sainsburys.co.uk/shop/gb/groceries/shiraz/barossa-valley-estate-grenache-shiraz-mourvèdre-75cl

i.e. an accented "e".

But it seems no matter what I request with import requests; requests.get(...) then I get a 404.

What is the proper input to give to the get method?

1 Answer 1

1

you should decode the url with 'latin-1' after passing it to urrlib unquote

>>> 
>>> k = "https://www.sainsburys.co.uk/shop/gb/groceries/shiraz/barossa-valley-estate-grenache-shiraz-mourv%C3%A8dre-75cl"
>>> r = requests.get(urllib.unquote(k).decode("latin-1"))
>>> r.status_code
200
>>> 
Sign up to request clarification or add additional context in comments.

4 Comments

The decode is unnecessary and incorrect. It happens to work in this case because è is in Latin-1, but it will fail for any codepoints above U+00FF, like あ. Just do requests.get(urllib.unquote(k)).
please read the op question , he did exactly what you proposed and ended up with 404 not found. decode is necessary and correct to open the link with requests.get .
Hmmm, you're right in this case...the OP's website seems to expect URLs to be Latin-1 encoded, but that's not the case for other sites.
Also noticed that requests.get(urllib.unquote(u'.......')) works or requests.get(urllib.unquote(u.decode('utf8'))) or requests.get(urllib.unquote(u.decode('latin-1'))), but requests.get(urllib.unquote(u).decode('utf8')) does not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.