Downloading files from URL in Python

Question

I have a CSV file that has 67000 rows of URL. Each URL leads to download other datasets in format of CSV, HWP, ZIP, etc.

This is the code I have written:

import cgi
import requests


SAVE_DIR = 'C:/dataset'

def downloadURLResource(url):
    r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return targetFileName


with open('C:/urlcsv.csv') as f:
    print(list(map(downloadURLResource, f.readlines())))

This code worked fine until it reached the 203rd URL.

When I checked on shell, this url didn't have content disposition and caused an error.

Moreover, it downloaded 201 and 202 but when I check the SAVE_DIR, there were 200 files total which means 2 files were missing.

My questions are:

(1) How do I know which files were not downloaded without manually checking the names of the downloaded files and URL? (No Error code was shown in Python Shell and it just skipped)

(2) How can I fix my code to print names of files or URLs which had not been downloaded? (Both skipped files that did not stop + no error shown on shell and ones that stopped and showed error on shell)

This is the error that stopped me from downloading:

Traceback (most recent call last):

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in <module>

   print(list(map(downloadURLResource, f.readlines())))

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 30, in downloadURLResource
    targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])

  File "C:\Python34\lib\site-packages\requests\structures.py", line 54, in __getitem__

   return self._store[key.lower()][1]

KeyError: 'content-disposition'

url http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e had no content-disposition header

Traceback (most recent call last):

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 46, in <module>

  print(list(map(downloadURLResource, f.readlines())))

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in downloadURLResource

  return targetFileName

UnboundLocalError: local variable 'targetFileName' referenced before assignment

Davy Kavanagh · Accepted Answer · 2017-07-24 10:45:06Z

2

You are filtering out anything that isn't a 200 response with if r.status_code == 200:

Exactly what you do depends in the response of the request when a file isn't there but assuming it would be a 404 you could try something like

r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        content_dispotistion = r.headers.get('content-disposition')
        if content_disposition is not None:
            targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
            with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
            return targetFileName
        else:
            print('url {} had no content-disposition header'.format(url))
    elif r.status_code == 404:
        print('{} returned a 404, no file was downloaded'.format(url))
    else:
        print('something else went wrong with {}'.format(url))

Your question is not very reproducible so it is hard for others to test. Consider add some of the URLs that caused problems to the question.

edited Jul 24, 2017 at 10:45

answered Jul 20, 2017 at 8:47

Davy Kavanagh

4,9999 gold badges37 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Do Hun Kim Over a year ago

Thank you. I will try the code you wrote. I wanted to add URLs that caused error. But as I have mentioned above, some URLs, I do not know which files were not downloaded unless I type each URLs on explorer and compare that names with files downloaded.. Here is the URL that did not have content dispostion thus I think caused an error on shell. data.go.kr/dataset/…

Davy Kavanagh Over a year ago

Can you add the stacktrace of the error to your question?

Do Hun Kim Over a year ago

Are you talking about error code printed on shell? If it is, I added that to my question.

Do Hun Kim Over a year ago

When I tried, it returned UnboundLocalError. Thanks to your code, it did spot an error and print URL that had no content-disposition. I still have questions. (1) Is there any way I could edit so that it keeps going until all URLs are tried? (2) It doesn't spot skipped URLs. In this case, do I just have to find it manually since I do not know what caused an error? I edited my question and uploaded error I had.

Davy Kavanagh Over a year ago

There was an error in my code above. The return statement was in the wrong place. I have amended it.

|

Collectives™ on Stack Overflow

Downloading files from URL in Python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related