I have a CSV file that has 67000 rows of URL. Each URL leads to download other datasets in format of CSV, HWP, ZIP, etc.
This is the code I have written:
import cgi
import requests
SAVE_DIR = 'C:/dataset'
def downloadURLResource(url):
r = requests.get(url.rstrip(), stream=True)
if r.status_code == 200:
targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
return targetFileName
with open('C:/urlcsv.csv') as f:
print(list(map(downloadURLResource, f.readlines())))
This code worked fine until it reached the 203rd URL.
When I checked on shell, this url didn't have content disposition and caused an error.
Moreover, it downloaded 201 and 202 but when I check the SAVE_DIR, there were 200 files total which means 2 files were missing.
My questions are:
(1) How do I know which files were not downloaded without manually checking the names of the downloaded files and URL? (No Error code was shown in Python Shell and it just skipped)
(2) How can I fix my code to print names of files or URLs which had not been downloaded? (Both skipped files that did not stop + no error shown on shell and ones that stopped and showed error on shell)
This is the error that stopped me from downloading:
Traceback (most recent call last):
File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in <module>
print(list(map(downloadURLResource, f.readlines())))
File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 30, in downloadURLResource
targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
File "C:\Python34\lib\site-packages\requests\structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'content-disposition'
url http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e had no content-disposition header
Traceback (most recent call last):
File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 46, in <module>
print(list(map(downloadURLResource, f.readlines())))
File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in downloadURLResource
return targetFileName
UnboundLocalError: local variable 'targetFileName' referenced before assignment