how to get raw html text of a given url using python

Question

I'm using html2text in python to get raw text (tags included) of a HTML page by taking any URL but I'm getting an error.

My code -

import html2text
import urllib2

proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>@<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)

The error -

Traceback (most recent call last):
  File "t.py", line 8, in <module>
    html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>

Can anyone explain what I'm doing wrong?

This doesn't have anything to do with html2text; it's an error in the URL fetch. Can you load that URL through a browser? Can you just try it again? Network errors like this are often intermittent. — Tom Hunt
– Tom Hunt, Commented Feb 19, 2015 at 16:01
yep its working fine on browser ....any other suggestios..?? — aquaman
– aquaman, Commented Feb 19, 2015 at 16:05
urllib2.urlopen already gives you the text; that error I don't know. — noɥʇʎԀʎzɐɹƆ
– noɥʇʎԀʎzɐɹƆ, Commented Feb 19, 2015 at 17:50
The error means that your script waited a long time but the server didn't say anything. — noɥʇʎԀʎzɐɹƆ
– noɥʇʎԀʎzɐɹƆ, Commented Feb 19, 2015 at 17:51
You need to improve your spelling and capitalization. I got banned for it once. — noɥʇʎԀʎzɐɹƆ
– noɥʇʎԀʎzɐɹƆ, Commented Feb 20, 2015 at 13:30

noɥʇʎԀʎzɐɹƆ · Accepted Answer · 2020-11-01 00:16:28Z

17

If you don't require SSL, this script in Python 2.7.x should work:

import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()

and in Python 3.x use urllib.request instead of urllib

Because urllib2 for Python 2, in Python 3 it was merged into urllib.

http:// is required.

EDIT: In 2020, you should use the 3rd party module requests. requests can be installed with pip.

import requests
print(requests.get("http://stackoverflow.com").text)

edited Nov 1, 2020 at 0:16

answered Feb 19, 2015 at 17:54

noɥʇʎԀʎzɐɹƆ

10.8k3 gold badges52 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

aquaman Over a year ago

sorry but it didnt help it gave the same error....do u have any other soltuion..??

Collectives™ on Stack Overflow

how to get raw html text of a given url using python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related