I have a CSV file that may have invalid UTF-8 encodings on some rows. The file is sometimes hundreds of thousands of rows long, so I want to just skip the rows with invalid characters (noting that) to get the 99.9% of the rows that are valid (for this application, it's not essential that every row in the input get into the database).
My Python code looks like this:
# Iterate through the CSV file
with open(fileName, "rt", encoding="utf8") as csvFile:
try:
reader = csv.DictReader(csvFile)
for csvDataRow in reader:
try:
log.debug('Row '+str(lineNo))
#
# .. row handling code here ..
#
except Exception as e:
log.error('Exception at the for loop level\n'+str(e))
except Exception as e:
log.error('Exception at the reader level\n'+str(e))
What I would expect is that the invalid data would trigger the exception at the for loop level, so I could catch just UnicodeEncodingError there and skip the line, then continue the loop.
The problem is that the exception doesn't trigger there - it hits the except clause at the reader level - i.e. outside the loop context. So I can no longer do continue on the for loop iterating over the rows.
The net result is that if I hit a single invalid row at line 674,398 in the CSV file that has a total of 2,966,480 rows the exception causes the rows after row 674,398 to be skipped. In this case, it turns out that line in the input has an invalid continuation character that breaks the UTF-8 parser.