0

I want to scrape a website (for player statistics from a football match) but I get a 403 error. This is my first attempt at scraping.

url = 'https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City'

headers = {'Sec-Fetch-Mode': 'no-cors',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} 

result = requests.get(url, headers=headers)

print(result.status_code)

Edit: I can open the webpage using my browser (chrome).

Edit2: If I run

print(result.status_code)
print(result.headers)
print(result.content)

then I get the following

403
{'Content-Type': 'text/html', 'Cache-Control': 'no-cache', 'Connection': 'close', 'Content-Length': '736', 'X-Iinfo': '9-168604272-0 0NNN RT(1566297863307 56) q(0 -1 -1 -1) r(0 -1) B15(4,200,0) U18', 'X-Iejgwucgyu': '1', 'Set-Cookie': 'visid_incap_774904=wSb3+5UxQeC+slK3rAhjswfPW10AAAAAQUIPAAAAAADmqJS6Gs0uzOV2Z5XomjoU; expires=Wed, 19 Aug 2020 06:56:00 GMT; path=/; Domain=.whoscored.com, incap_ses_198_774904=2GHrGcAd9C8niMLwwnK/AgfPW10AAAAAttp7+XadyowHY5iqiWs/Yg==; path=/; Domain=.whoscored.com'}
b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?CWUDNSAI=21&xinfo=9-168604272-0%200NNN%20RT%281566297863307%2056%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%284%2c200%2c0%29%20U18&incident_id=198003090216026722-548063901729035097&edet=15&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 198003090216026722-548063901729035097</iframe></body></html>'
4
  • 2
    your code gives me status 200 - even without User-Agent. Maybe server had problem one day. Can you open it in web browser? Or maybe you made so many requests so server blocks you. Commented Aug 20, 2019 at 10:11
  • @furas I can open it with my web browser Commented Aug 20, 2019 at 10:39
  • I have added some more details. result.content contains the word ROBOTS which makes me think my request has been handled as a bot. Commented Aug 20, 2019 at 10:45
  • Yeah they don't allow robots (programmatic access). There are other ways (a bit complicated than just requests.get) you can mimic a web browser, but not using requests. You can see this question for details: stackoverflow.com/q/22966787/9321755 Commented Aug 20, 2019 at 10:49

1 Answer 1

2

You need to add cookies to your session. It works. I have added cookies from my browser.

import requests

session = requests.Session()

session.headers.update({'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0"})

session.cookies["visid_incap_774904"]="SRvZ2F36RzuA5U8jaUC8yq3fXF0AAAAAQUIPAAAAAAC/7mBuVWtbzccGROHlxPzv"
session.cookies["incap_ses_964_774904"]="hJHbakasVSAoo8+/rNFgDa7fXF0AAAAA0e9groglmml+odd4mLW2zg=="
session.cookies["_cmpQcif3pcsupported"]="0"
session.cookies["googlepersonalization"]="OloL0IOloL0IgA"
session.cookies["eupubconsent"]="BOloL0IOloL0IAKAYAENAAAA6AAAAA"
session.cookies["euconsent"]="BOloL0IOloL0IAKAYBENCh-AAAAp57v______9______9uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4u_1vf99yfm1-7etr3tp_87ues2_Xur__79__3z3_9phP78k89r7337Ew-v83oA"

resp = session.get("https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City")

print(resp.status_code)

print(resp.text)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.