2

I simply wish to download .html files in python. Code:

import urllib2 
    hdr = {'User-Agent': 'Mozilla/5.0'}
    urls=['http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211']
    path='C:/Users/sony/Desktop/Python'
    for i,site in enumerate(urls):
        print (site)
        req = urllib2.Request(site, headers=hdr)
        page = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(req)
        page_content = page.read()
        with open(path+'/'+str(i)+'.html', 'w') as fid:
            fid.write(page_content)

But this gives this output sometimes https://drive.google.com/file/d/0B16PrXUjs69zWFJvWmJ6aFhyN0k/view?usp=sharing which I don't get at all. I read this file using goose in python which shows nothing when I read such a file.

When it doesn't work: http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211

1
  • @LutzHorn Please check now. Sry for not mentioning it before. Commented May 21, 2015 at 13:35

2 Answers 2

1

Use requests to do all the work for you using .content to let requests handle the encoding:

import requests

urls=['http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211']

 path='C:/Users/sony/Desktop/Python'

for i,site in enumerate(urls):
    print (site)
    req = requests.get(site)
    page_content = req.content
    with open ('{}{}.html'.format(path,i), 'w') as fid:
        fid.write(page_content)

Output:

 <!DOCTYPE html> <!--NEW--> <!--- www pageHead.vm ---> <!--- mode=www ---> <!--- URI=/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211 ---> <!--- Host=www.nydailynews.com ---> <!--[if IE 8]><html class="ie8" lang="en" itemscope itemtype="http://schema.org/"><![endif]--> <!--[if IE 9]><html class="ie9" lang="en" itemscope itemtype="http://schema.org/"><![endif]--> <!--[if IE 10]><html class="ie10" lang="en" itemscope itemtype="http://schema.org/"><![endif]--> <!--[if IE 11]><html class="ie11" lang="en" itemscope itemtype="http://schema.org/"><![endif]--> <!--[if !IE]><!--> <html lang="en" itemscope itemtype="http://schema.org/"> <!--<![endif]-->       <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="X-UA-Compatible" content="IE=edge"/>          <title>Fans stampede outside South African stadium - NY Daily News</title>     <meta name="nydn_section" content="Sports"/>   <meta name="viewport" content="width=1070, maximum-scale=1.0"/>  <meta property="fb:app_id" content="107464888913"/> <meta property="fb:admins" content="1594068001"/> <meta property="og:site_name" content="NY Daily News"/> <meta property="article:publisher" content="https://www.facebook.com/thenewyorkdailynews"/> <meta name="msvalidate.01" content="02916AAC0DA8B068EFE01D721E03ED7E"/>    <meta name="twitter:card" content="summary"> <meta name="twitter:site" content="@nydailynews"> <meta property="twitter:url" content="http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211"/> <meta property="twitter:title" content="Fans stampede outside South African stadium"/> <meta property="twitter:description" content="Thousands of fans stampeded outside the stadium gates of a World Cup warmup game Sunday, five days before the start of soccer's showcase event. Several fans could be seen falling under the crush of people, many wearing Nigeria jerseys."/> <meta id="og_title" property="og:title" content="Fans stampede outside South African stadium"/> <meta property="og:type" content="article"/> <meta id="og_url" property="og:url" content="http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211"/>   <meta id="og_image" property="og:image" content="http://assets.nydailynews.com/polopoly_fs/1.179213!/img/httpImage/image.jpg_gen/derivatives/landscape_1200/alg-stampede-johannesburg-jpg.jpg"/>   <meta id="og_description" property="og:description" content="Thousands of fans stampeded outside the stadium gates of a World Cup warmup game Sunday, five days before the start of soccer's showcase event. Several fans could be seen falling under the crush of people, many wearing Nigeria jerseys."/> <meta name="description" content="Thousands of fans stampeded outside the stadium gates of a World Cup warmup game Sunday, five days before the start of soccer's showcase event. Several fans could be seen falling under the crush of people, many wearing Nigeria jerseys."/>   <meta name="nydn_byline" content="MICHAEL LEWIS"/>   <link rel="stylesheet" type="text/css" href="http://assets.nydailynews.com/nydn/c/nydn.css?r=20120405mk1Bh">  <meta name="parsely-title" content="Fans stampede outside South African stadium"/> <meta name="parsely-link" content="http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211"/> <meta name="parsely-type" content="article"/> <meta name="parsely-image-url" content="http://assets.nydailynews.com/polopoly_fs/1.179213!/img/httpImage/image.jpg_gen/derivatives/landscape_1200/alg-stampede-johannesburg-jpg.jpg"/>    <meta name="parsely-pub-date" content="2010-06-06T15:01:04"/>   <meta name="parsely-section" content="Sports"/>   <meta name="parsely-author" content="Michael Lewis"/>       <link rel="stylesheet" type="text/css" href="http://assets.nydailynews.com/nydn/c/article.css?r=20120405mk1Bh">       <meta name="robots" content="NOARCHIVE"/>         <link rel="canonical" href="http://www.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211">  <link rel="alternate" media="handheld" href="http://m.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211"> <link rel="alternate" media="only screen and (max-width: 640px)" href="http://m.nydailynews.com/sports/soccer-fans-stampede-south-african-stadium-nigeria-north-korea-world-cup-warmup-article-1.179211"/>     <script type="text/javascript" src="http://assets.nydailynews.com/nydn/js/nydn-pack-20140101.js?r=20120405mk1Bh"></script>   <script type="text/javascript" src="http://assets.nydailynews.com/nydn/js/article2014.js?r=20120405mk1Bh"></script>          <!--[if lt IE 9]><script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->      


                              <link rel="alternate" type="application/rss+xml" title="NYDN Rss" href="http://feeds.nydailynews.com/nydnrss">              <link rel="alternate" type="application/rss+xml" title="Sports Rss" href="http://feeds.feedburner.com/nydnrss/sports">       

..........................

If you want to try the same url a few times you can use a try/except catching the requests.ConnectionError:

def tries(path, url, i, max_tries=1):
    for ty in range(1, max_tries+1):
        try:
            req = requests.get(url)
            page_content = req.content
            with open('{}{}.html'.format(path, i), 'w') as fid:
                fid.write(page_content)
            break
        except requests.exceptions.ConnectionError as e:
            print("Error {} for try {}".format(e, ty))


for ind, url in enumerate(urls):
    tries(path, url, ind, 4)
Sign up to request clarification or add additional context in comments.

17 Comments

Since work on slow internet connection I face some timeout issues stackoverflow.com/questions/30373301/… . Do you think this will solve that as well? Or should I explictly specify a timeout in the above?
Cunninghan there is some error with your code with open(path {}.html'.format(i), 'w') as fid: ^ SyntaxError: invalid syntax
Yep. Missed an opening quote. Fixed. What do you want to happen if it times out?
Try it now. Working from phone so hard to see properly
Do you want to keep trying if that happens?
|
1

From looking at the response header:

>> print page.info()
Cache-Control: public, max-age=300, s-maxage=300
Content-Type: text/html;charset=utf-8
Server: fs3
Age: 103
Expires: Thu, 21 May 2015 13:36:40 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked
Connection: close
Vary: Accept-encoding, Accept-Encoding

I see the content is gzipped, try to use zlib module to decompress the data.

To check if the data is gzipped add the following line:

if page.info().get('Content-Encoding', '') == 'gzip':
    ... # decompress data

Please, read this to have an example of how to decompress the body.

6 Comments

@schelzz15 What solution do you suggest?
@schelzz15 Two points- 1. Since I do this for many urls and this case is not present in all. Is there some possible way to check it 2. Can you provide a example please?
@AbhishekBhatia I have added an URL to an example of how to do that.
Since all are not in gzip. I think there is should a check also. Can you specify how one can do that.
I don't think http servers are supposed to send gzip-encoded data unless the client lists that they support it (and I don't think urllib2 does out-of-the-box).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.