1

I'm trying to scrape this site and I get 403 code its the first time I've had this code when web scraping and I don't really understand what I have to do to solve it. I think maybe I can use Selenium to scrape the page, but I wonder if its possible to get the AJAX response and get the JSON as a return. If its not possible to get a return could I get an explaination of why? Thanks.

Here is my code:

import requests
url = 'https://public-api.pricempire.com/api/item/loadGraph/14/1140'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

r = requests.get(url, headers=headers)
print(r.status_code)

Code generated from cURL insomnia

import requests

url = "https://public-api.pricempire.com/api/item/loadGraph/14/875"

payload = ""
headers = {
    "authority": "public-api.pricempire.com",
    "pragma": "no-cache",
    "cache-control": "no-cache",
    "sec-ch-ua": "^\^"
}

response = requests.request("GET", url, data=payload, headers=headers)

print(response.text)

First two times I ran it, it gave me status 200, but afterwards it gives me 403, I'm trying to figure out why and I just don't know.

3
  • 1
    The website decided it didn't want to talk to you. There might not be any way to get an explanation why. Commented Dec 20, 2021 at 20:52
  • Do you know why when I type the link in the browser it returns a JSON but requesting it via Python it doesn't? Im confused about this part. Commented Dec 20, 2021 at 21:08
  • Probably it didn't like your useragent or your ip address. Commented Dec 20, 2021 at 21:12

2 Answers 2

2

This page looks like it isn't public so there is need for some sort of authenticate earlier. In such case you need to see what authenticate mechanism is used and tried to reproduce that with requests library.

So open web inspector in browser, go to network tab, right click the request to page and copy as cURL. Probably you would see some bearer token in headers (or maybe there will be some cookie with session_id), append it to your program headers/cookies and it should work.

Sign up to request clarification or add additional context in comments.

5 Comments

If you struggle, paste cURL command here so will be able to transform it into python's code
Thanks, did just that, it worked the first two times I ran the code from my edit above. But afterwards it just returns 403, trying to figure out why and just don't know where to go.
It's quite simple - all such authorization tokens has some expire time. So what most likely happened, you visited that page in past, authorize yourself in some way (e.g. via login and password or via login by facebook - OAuth) and get such token which expire recently. If you want fully automate that process, you need to send such authorization request with python (e.g. send this login and password with body) and use the token you get in response. If you stuck I will try to prepare example today/tomorrow
Thanks for the explaination I sorta understand it, I will look into it after work tonight.
@kosciej16 Hi. I'm using httpx library for web-scraping and I have to crawl over 40000 links. Actually it did well before but now I got blocked. I get status code 403. I wrote a cookie it solved. But as you said there is a expire time for that. How can I automate it ?
0

Sometimes a lot of techniques doesn't work. So the final way is to get the content of the Google Cache.

import requests

# The headers 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}

# The URL you want to scrap
url_2_scrap = 'https://www.my_url.com'

# Full URL to get the content 
url_full = 'https://webcache.googleusercontent.com/search?q=cache:' + url_2_scrap

# Response of the request
response = requests.get(url_full, headers=headers)

# If the status is good,
if response.status_code == 200:
    print("OK! It works fine! ;-)")
# If its not good,
else:
    print("It doesn't work :-(")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.