Python Reading File and Identifying Source of UnicodeDecodeError

Question

I am trying to read a text file using the following statement:

with open(inputFile) as fp:  
    for line in fp:
        if len(line) > 0:
            lineRecords.append(line.strip());

The problem is that I get the following error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>

My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.

Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?

(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)

Thanks.

Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.

Possible duplicate of UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function — razimbres
– razimbres, Commented Mar 5, 2019 at 22:38
Why don't you try to read the file in binary mode and print the chars at position 6875:6885 so you can see the bad char at position 6880 (from the output)? — patito
– patito, Commented Mar 5, 2019 at 22:49
I don't need to read the file in binary mode to find out what the character is as Python provides that information. My objective is to find out what record has the bad character. Without that information, the byte offset information that Python provides is utterly worthless. — Jim
– Jim, Commented Mar 6, 2019 at 4:05

Mark Ransom · Accepted Answer · 2019-03-05 23:38:18Z

2

I think the only way is to track the line number separately and output it yourself.

with open(inputFile) as fp:
    num = 0
    try:
        for num, line in enumerate(fp):
            if len(line) > 0:
                lineRecords.append(line.strip())
    except UnicodeDecodeError as e:
        print('Line ', num, e)

answered Mar 5, 2019 at 23:38

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jim Over a year ago

Hi Mark, Thanks for this code. I was sure that it would work but the output I get back is puzzling. On run 1 the output is: Line 9 'charmap' codec can't decode byte 0x9d in position 3649: So I delete the lines before line 9 and run the program again and the output becomes: Line 0 'charmap' codec can't decode byte 0x9d in position 4490 I expected to see line 0 but I did not expect the position value to change. And note that line 0 only has 955 characters in it. Looks like "for line in fp' has nothing to do with reading records.

Mark Ransom Over a year ago

@Jim that's unfortunate. The only other thing I could suggest is to read the file in binary mode and decode it yourself, but that doesn't let you read it line-by-line.

Mark Ransom Over a year ago

@Jim this is a perfect example of the law of leaky abstractions.

Jim Over a year ago

Hi Mark, I am still puzzling over this. I tried using readlines() but it behaves just the same as the for enumerate() code - which is reassuring in once sense. That's two different techniques producing the same output. I've got to figure that I'm missing something.

Mark Ransom Over a year ago

@Jim I think reading a text file in Python occurs in multiple pieces. First an internal buffer is filled from the file, then that buffer is decoded, and finally the decoded buffer is split into lines. The offset reported in the exception is relative to the start of an internal buffer that you can't see.

|

blhsing · Accepted Answer · 2019-03-05 22:51:30Z

0

You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:

with open(inputFile) as fp:
    print(len(fp.read(6880).encode()))

answered Mar 5, 2019 at 22:51

blhsing

109k9 gold badges88 silver badges132 bronze badges

1 Comment

Jim Over a year ago

But this solution offers no information as to what record contains the bad character. And the 6880 reference is only valid for that single record and not any other.

Majd Msahel · Accepted Answer · 2019-03-06 04:10:04Z

0

I have faced this issue before and the easiest fix is to open file in utf8 mode

with open(inputFile, encoding="utf8") as fp:

answered Mar 6, 2019 at 4:10

Majd Msahel

561 silver badge2 bronze badges

4 Comments

Mark Ransom Over a year ago

That's not going to help if the file isn't actually UTF-8 encoded. The question was about identifying where the offending character is in the file, not trying to fix it blindly.

Jim Over a year ago

Hi Majd, I actually did that as well. In that case I wound up getting a different set of encoding errors. So whether using the encoding option or not, what I'd like to get to is an error message of the sort "on record N the character as the position P is invalid."

Majd Msahel Over a year ago

what python version you are using?

Jim Over a year ago

Python v3.5.2 though I doubt that it is a relevant factor in this situation.

Collectives™ on Stack Overflow

Python Reading File and Identifying Source of UnicodeDecodeError

3 Answers 3

6 Comments

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related