0

I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")

Current code:

try:
    req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
    client = urlopen(req)

    #downloading html data
    page_html = client.read()

    #closing connection
    client.close()
except:
    print("The following URL was not found. Program terminated.\n" + URL)
    break
1
  • 2
    See HTTPError. It has a .read() method which returns the response content. Commented Nov 4, 2018 at 10:07

2 Answers 2

2

Have you tried the requests library?

Just install the library with pip

pip install requests

And use it like this

import requests

response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
Sign up to request clarification or add additional context in comments.

Comments

0

To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside :

By t.m.adam at Nov 4, 2018 at 10:07

See HTTPError. It has a .read() method which returns the response content. –

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.