2

I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.

from pathlib import Path
import requests

filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)

Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error

b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;www&#46;rolls&#45;royce&#46;com&#47;&#37;7e&#47;media&#47;Files&#47;R&#47;Rolls&#45;Royce&#47;documents&#47;investors&#47;annual&#45;reports&#47;rr&#45;full&#37;20annual&#37;20report&#45;&#45;tcm92&#45;55530&#46;pdf" on this server.<P>\nReference&#32;&#35;18&#46;36ad4d68&#46;1562842755&#46;6294c42\n</BODY>\n</HTML>\n'

Is there any way to get rid out of this?

3
  • have you tried setting User-Agent header? Commented Jul 11, 2019 at 12:31
  • No I did not try, can you help me out exactly what argument should be passed? Commented Jul 12, 2019 at 8:48
  • See the answer) Commented Jul 12, 2019 at 10:44

1 Answer 1

1

You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.

You can get the correct value for this header from your browser and everything will be fine.

For example:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'

r = requests.get(url, headers=headers)
print(r.status_code)  # 200
Sign up to request clarification or add additional context in comments.

2 Comments

status code received 200 but It does not download the PDF successfully. but when i am using headers={'User-Agent': 'My Browser'} then It downloads the PDF successfully.
However, I got another used case. I am not able to download the PDF for the link (eisai.com/ir/library/annual/pdf/epdf2017ir.pdf) even using the headers as mentioned earlier.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.