Python - Handling Data errors in CSV file

Question

I have a CSV file that may have invalid UTF-8 encodings on some rows. The file is sometimes hundreds of thousands of rows long, so I want to just skip the rows with invalid characters (noting that) to get the 99.9% of the rows that are valid (for this application, it's not essential that every row in the input get into the database).

My Python code looks like this:

# Iterate through the CSV file
with open(fileName, "rt", encoding="utf8") as csvFile:
    try:
        reader = csv.DictReader(csvFile)
        for csvDataRow in reader:
            try:
                log.debug('Row '+str(lineNo))
                #
                # .. row handling code here ..
                #
            except Exception as e:
                log.error('Exception at the for loop level\n'+str(e))
    except Exception as e:
            log.error('Exception at the reader level\n'+str(e))

What I would expect is that the invalid data would trigger the exception at the for loop level, so I could catch just UnicodeEncodingError there and skip the line, then continue the loop.

The problem is that the exception doesn't trigger there - it hits the except clause at the reader level - i.e. outside the loop context. So I can no longer do continue on the for loop iterating over the rows.

The net result is that if I hit a single invalid row at line 674,398 in the CSV file that has a total of 2,966,480 rows the exception causes the rows after row 674,398 to be skipped. In this case, it turns out that line in the input has an invalid continuation character that breaks the UTF-8 parser.

Mike Kelly · Accepted Answer · 2019-02-27 23:41:07Z

4

I spent a fair bit of time reading the Python CSV documentation and searching around to find a solution to this. The key seems to be that the exception is coming from this line:

       for csvDataRow in reader:

i.e. it is being triggered in the call to the DictReader iterator to get the next row. Nowhere in the CSV documentation does it mention how to handle errors like this.

The trick is that the encoding transformation isn't happening in CSV - it's happening underneath it, and so the change that's needed is in the open call.

Adding errors="replace" to the open call causes the Codec transform to substitute a '?' for any invalid characters in the input.

      with open(fileName, "rt", encoding="utf8", errors="replace") as csvFile:

answered Feb 27, 2019 at 23:41

Mike Kelly

1,0292 gold badges14 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python - Handling Data errors in CSV file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related