Python unable to decode byte string

Question

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:

fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte

Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)

Code is:

    with open(inputne, "rb") as file:
        while 1:
            readBytes= file.read(dataMaxSize)
            fileStrings.append(readBytes)
            if not readBytes:
                break
            readBytes= ''
    
    filesize=0
    for i in range(0, len(fileStrings)):
        fileStrings[i] = fileStrings[i].decode()
        filesize += len(fileStrings[i])

Edit: For anyone having same issue, parameter len() will give you size without b''.

"size in bytes" - decoding would translate bytes to characters, and the number of characters is not the same as the number of bytes. ∞ is one symbol but 3 bytes: b'\xe2\x88\x9e', or 8 bytes in UTF32. — ForceBru
– ForceBru, Commented Nov 30, 2020 at 15:23

Aplet123 · Accepted Answer · 2020-11-30 15:22:20Z

1

In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).

As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

answered Nov 30, 2020 at 15:22

Aplet123

35.8k1 gold badge41 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dudo Over a year ago

Doesn't it count b' ' into size when I use len()? EDIT: No it doesn't count b' ' into len, as side note for someone having same issue as me. Thanks for your answer @Aplet123 , it helped.

Collectives™ on Stack Overflow

Python unable to decode byte string

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related