0

I have a CSV file that has 67000 rows of URL. Each URL leads to download other datasets in format of CSV, HWP, ZIP, etc.

This is the code I have written:

import cgi
import requests


SAVE_DIR = 'C:/dataset'

def downloadURLResource(url):
    r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return targetFileName


with open('C:/urlcsv.csv') as f:
    print(list(map(downloadURLResource, f.readlines())))

This code worked fine until it reached the 203rd URL.

When I checked on shell, this url didn't have content disposition and caused an error.

Moreover, it downloaded 201 and 202 but when I check the SAVE_DIR, there were 200 files total which means 2 files were missing.

My questions are:

(1) How do I know which files were not downloaded without manually checking the names of the downloaded files and URL? (No Error code was shown in Python Shell and it just skipped)

(2) How can I fix my code to print names of files or URLs which had not been downloaded? (Both skipped files that did not stop + no error shown on shell and ones that stopped and showed error on shell)


This is the error that stopped me from downloading:

Traceback (most recent call last):

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in <module>

   print(list(map(downloadURLResource, f.readlines())))

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 30, in downloadURLResource
    targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])

  File "C:\Python34\lib\site-packages\requests\structures.py", line 54, in __getitem__

   return self._store[key.lower()][1]

KeyError: 'content-disposition'

url http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e had no content-disposition header

Traceback (most recent call last):

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 46, in <module>

  print(list(map(downloadURLResource, f.readlines())))

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in downloadURLResource

  return targetFileName

UnboundLocalError: local variable 'targetFileName' referenced before assignment

1 Answer 1

2

You are filtering out anything that isn't a 200 response with if r.status_code == 200:

Exactly what you do depends in the response of the request when a file isn't there but assuming it would be a 404 you could try something like

r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        content_dispotistion = r.headers.get('content-disposition')
        if content_disposition is not None:
            targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
            with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
            return targetFileName
        else:
            print('url {} had no content-disposition header'.format(url))
    elif r.status_code == 404:
        print('{} returned a 404, no file was downloaded'.format(url))
    else:
        print('something else went wrong with {}'.format(url))

Your question is not very reproducible so it is hard for others to test. Consider add some of the URLs that caused problems to the question.

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you. I will try the code you wrote. I wanted to add URLs that caused error. But as I have mentioned above, some URLs, I do not know which files were not downloaded unless I type each URLs on explorer and compare that names with files downloaded.. Here is the URL that did not have content dispostion thus I think caused an error on shell. data.go.kr/dataset/…
Can you add the stacktrace of the error to your question?
Are you talking about error code printed on shell? If it is, I added that to my question.
When I tried, it returned UnboundLocalError. Thanks to your code, it did spot an error and print URL that had no content-disposition. I still have questions. (1) Is there any way I could edit so that it keeps going until all URLs are tried? (2) It doesn't spot skipped URLs. In this case, do I just have to find it manually since I do not know what caused an error? I edited my question and uploaded error I had.
There was an error in my code above. The return statement was in the wrong place. I have amended it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.