403 Error when scraping despite setting User-Agent in header

Question

I want to scrape a website (for player statistics from a football match) but I get a 403 error. This is my first attempt at scraping.

url = 'https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City'

headers = {'Sec-Fetch-Mode': 'no-cors',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} 

result = requests.get(url, headers=headers)

print(result.status_code)

Edit: I can open the webpage using my browser (chrome).

Edit2: If I run

print(result.status_code)
print(result.headers)
print(result.content)

then I get the following

403
{'Content-Type': 'text/html', 'Cache-Control': 'no-cache', 'Connection': 'close', 'Content-Length': '736', 'X-Iinfo': '9-168604272-0 0NNN RT(1566297863307 56) q(0 -1 -1 -1) r(0 -1) B15(4,200,0) U18', 'X-Iejgwucgyu': '1', 'Set-Cookie': 'visid_incap_774904=wSb3+5UxQeC+slK3rAhjswfPW10AAAAAQUIPAAAAAADmqJS6Gs0uzOV2Z5XomjoU; expires=Wed, 19 Aug 2020 06:56:00 GMT; path=/; Domain=.whoscored.com, incap_ses_198_774904=2GHrGcAd9C8niMLwwnK/AgfPW10AAAAAttp7+XadyowHY5iqiWs/Yg==; path=/; Domain=.whoscored.com'}
b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?CWUDNSAI=21&xinfo=9-168604272-0%200NNN%20RT%281566297863307%2056%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%284%2c200%2c0%29%20U18&incident_id=198003090216026722-548063901729035097&edet=15&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 198003090216026722-548063901729035097</iframe></body></html>'

your code gives me status 200 - even without User-Agent. Maybe server had problem one day. Can you open it in web browser? Or maybe you made so many requests so server blocks you. — furas
– furas, Commented Aug 20, 2019 at 10:11
I have added some more details. result.content contains the word ROBOTS which makes me think my request has been handled as a bot. — Chris Russell
– Chris Russell, Commented Aug 20, 2019 at 10:45
Yeah they don't allow robots (programmatic access). There are other ways (a bit complicated than just requests.get) you can mimic a web browser, but not using requests. You can see this question for details: stackoverflow.com/q/22966787/9321755 — Vaibhav Vishal
– Vaibhav Vishal, Commented Aug 20, 2019 at 10:49

Vim · Accepted Answer · 2019-08-21 06:28:48Z

You need to add cookies to your session. It works. I have added cookies from my browser.

import requests

session = requests.Session()

session.headers.update({'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0"})

session.cookies["visid_incap_774904"]="SRvZ2F36RzuA5U8jaUC8yq3fXF0AAAAAQUIPAAAAAAC/7mBuVWtbzccGROHlxPzv"
session.cookies["incap_ses_964_774904"]="hJHbakasVSAoo8+/rNFgDa7fXF0AAAAA0e9groglmml+odd4mLW2zg=="
session.cookies["_cmpQcif3pcsupported"]="0"
session.cookies["googlepersonalization"]="OloL0IOloL0IgA"
session.cookies["eupubconsent"]="BOloL0IOloL0IAKAYAENAAAA6AAAAA"
session.cookies["euconsent"]="BOloL0IOloL0IAKAYBENCh-AAAAp57v______9______9uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4u_1vf99yfm1-7etr3tp_87ues2_Xur__79__3z3_9phP78k89r7337Ew-v83oA"

resp = session.get("https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City")

print(resp.status_code)

print(resp.text)

Collectives™ on Stack Overflow

403 Error when scraping despite setting User-Agent in header

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related