Python Webscraping HTTP returns 403 Forbidden Status Code

Question

I'm trying to scrape this site and I get 403 code its the first time I've had this code when web scraping and I don't really understand what I have to do to solve it. I think maybe I can use Selenium to scrape the page, but I wonder if its possible to get the AJAX response and get the JSON as a return. If its not possible to get a return could I get an explaination of why? Thanks.

Here is my code:

import requests
url = 'https://public-api.pricempire.com/api/item/loadGraph/14/1140'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

r = requests.get(url, headers=headers)
print(r.status_code)

Code generated from cURL insomnia

import requests

url = "https://public-api.pricempire.com/api/item/loadGraph/14/875"

payload = ""
headers = {
    "authority": "public-api.pricempire.com",
    "pragma": "no-cache",
    "cache-control": "no-cache",
    "sec-ch-ua": "^\^"
}

response = requests.request("GET", url, data=payload, headers=headers)

print(response.text)

First two times I ran it, it gave me status 200, but afterwards it gives me 403, I'm trying to figure out why and I just don't know.

The website decided it didn't want to talk to you. There might not be any way to get an explanation why. — John Gordon
– John Gordon, Commented Dec 20, 2021 at 20:52
Do you know why when I type the link in the browser it returns a JSON but requesting it via Python it doesn't? Im confused about this part. — QJ123123
– QJ123123, Commented Dec 20, 2021 at 21:08

kosciej16 · Accepted Answer · 2021-12-21 00:09:14Z

2

This page looks like it isn't public so there is need for some sort of authenticate earlier. In such case you need to see what authenticate mechanism is used and tried to reproduce that with requests library.

So open web inspector in browser, go to network tab, right click the request to page and copy as cURL. Probably you would see some bearer token in headers (or maybe there will be some cookie with session_id), append it to your program headers/cookies and it should work.

answered Dec 21, 2021 at 0:09

kosciej16

7,2883 gold badges21 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

kosciej16 Over a year ago

If you struggle, paste cURL command here so will be able to transform it into python's code

QJ123123 Over a year ago

Thanks, did just that, it worked the first two times I ran the code from my edit above. But afterwards it just returns 403, trying to figure out why and just don't know where to go.

kosciej16 Over a year ago

It's quite simple - all such authorization tokens has some expire time. So what most likely happened, you visited that page in past, authorize yourself in some way (e.g. via login and password or via login by facebook - OAuth) and get such token which expire recently. If you want fully automate that process, you need to send such authorization request with python (e.g. send this login and password with body) and use the token you get in response. If you stuck I will try to prepare example today/tomorrow

QJ123123 Over a year ago

Thanks for the explaination I sorta understand it, I will look into it after work tonight.

Elkhan Over a year ago

@kosciej16 Hi. I'm using httpx library for web-scraping and I have to crawl over 40000 links. Actually it did well before but now I got blocked. I get status code 403. I wrote a cookie it solved. But as you said there is a expire time for that. How can I automate it ?

KBill · Accepted Answer · 2024-02-07 10:19:40Z

0

Sometimes a lot of techniques doesn't work. So the final way is to get the content of the Google Cache.

import requests

# The headers 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}

# The URL you want to scrap
url_2_scrap = 'https://www.my_url.com'

# Full URL to get the content 
url_full = 'https://webcache.googleusercontent.com/search?q=cache:' + url_2_scrap

# Response of the request
response = requests.get(url_full, headers=headers)

# If the status is good,
if response.status_code == 200:
    print("OK! It works fine! ;-)")
# If its not good,
else:
    print("It doesn't work :-(")

answered Feb 7, 2024 at 10:19

KBill

851 silver badge13 bronze badges

Collectives™ on Stack Overflow

Python Webscraping HTTP returns 403 Forbidden Status Code

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related