0

I have a CSV file that may have invalid UTF-8 encodings on some rows. The file is sometimes hundreds of thousands of rows long, so I want to just skip the rows with invalid characters (noting that) to get the 99.9% of the rows that are valid (for this application, it's not essential that every row in the input get into the database).

My Python code looks like this:

# Iterate through the CSV file
with open(fileName, "rt", encoding="utf8") as csvFile:
    try:
        reader = csv.DictReader(csvFile)
        for csvDataRow in reader:
            try:
                log.debug('Row '+str(lineNo))
                #
                # .. row handling code here ..
                #
            except Exception as e:
                log.error('Exception at the for loop level\n'+str(e))
    except Exception as e:
            log.error('Exception at the reader level\n'+str(e))

What I would expect is that the invalid data would trigger the exception at the for loop level, so I could catch just UnicodeEncodingError there and skip the line, then continue the loop.

The problem is that the exception doesn't trigger there - it hits the except clause at the reader level - i.e. outside the loop context. So I can no longer do continue on the for loop iterating over the rows.

The net result is that if I hit a single invalid row at line 674,398 in the CSV file that has a total of 2,966,480 rows the exception causes the rows after row 674,398 to be skipped. In this case, it turns out that line in the input has an invalid continuation character that breaks the UTF-8 parser.

1 Answer 1

4

I spent a fair bit of time reading the Python CSV documentation and searching around to find a solution to this. The key seems to be that the exception is coming from this line:

       for csvDataRow in reader:

i.e. it is being triggered in the call to the DictReader iterator to get the next row. Nowhere in the CSV documentation does it mention how to handle errors like this.

The trick is that the encoding transformation isn't happening in CSV - it's happening underneath it, and so the change that's needed is in the open call.

Adding errors="replace" to the open call causes the Codec transform to substitute a '?' for any invalid characters in the input.

      with open(fileName, "rt", encoding="utf8", errors="replace") as csvFile:
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.