1

I am trying to read a text file using the following statement:

with open(inputFile) as fp:  
    for line in fp:
        if len(line) > 0:
            lineRecords.append(line.strip());

The problem is that I get the following error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>

My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.

Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?

(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)

Thanks.

Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.

4
  • Possible duplicate of UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function Commented Mar 5, 2019 at 22:38
  • Use try except? Commented Mar 5, 2019 at 22:40
  • Why don't you try to read the file in binary mode and print the chars at position 6875:6885 so you can see the bad char at position 6880 (from the output)? Commented Mar 5, 2019 at 22:49
  • I don't need to read the file in binary mode to find out what the character is as Python provides that information. My objective is to find out what record has the bad character. Without that information, the byte offset information that Python provides is utterly worthless. Commented Mar 6, 2019 at 4:05

3 Answers 3

2

I think the only way is to track the line number separately and output it yourself.

with open(inputFile) as fp:
    num = 0
    try:
        for num, line in enumerate(fp):
            if len(line) > 0:
                lineRecords.append(line.strip())
    except UnicodeDecodeError as e:
        print('Line ', num, e)
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Mark, Thanks for this code. I was sure that it would work but the output I get back is puzzling. On run 1 the output is: Line 9 'charmap' codec can't decode byte 0x9d in position 3649: So I delete the lines before line 9 and run the program again and the output becomes: Line 0 'charmap' codec can't decode byte 0x9d in position 4490 I expected to see line 0 but I did not expect the position value to change. And note that line 0 only has 955 characters in it. Looks like "for line in fp' has nothing to do with reading records.
@Jim that's unfortunate. The only other thing I could suggest is to read the file in binary mode and decode it yourself, but that doesn't let you read it line-by-line.
@Jim this is a perfect example of the law of leaky abstractions.
Hi Mark, I am still puzzling over this. I tried using readlines() but it behaves just the same as the for enumerate() code - which is reassuring in once sense. That's two different techniques producing the same output. I've got to figure that I'm missing something.
@Jim I think reading a text file in Python occurs in multiple pieces. First an internal buffer is filled from the file, then that buffer is decoded, and finally the decoded buffer is split into lines. The offset reported in the exception is relative to the start of an internal buffer that you can't see.
|
0

You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:

with open(inputFile) as fp:
    print(len(fp.read(6880).encode()))

1 Comment

But this solution offers no information as to what record contains the bad character. And the 6880 reference is only valid for that single record and not any other.
0

I have faced this issue before and the easiest fix is to open file in utf8 mode

with open(inputFile, encoding="utf8") as fp:

4 Comments

That's not going to help if the file isn't actually UTF-8 encoded. The question was about identifying where the offending character is in the file, not trying to fix it blindly.
Hi Majd, I actually did that as well. In that case I wound up getting a different set of encoding errors. So whether using the encoding option or not, what I'd like to get to is an error message of the sort "on record N the character as the position P is invalid."
what python version you are using?
Python v3.5.2 though I doubt that it is a relevant factor in this situation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.