Cannot properly decode string

Question

I have a string from reading a .txt file that looks something like this:

str='\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n\x00'

The contents of the file is in Portuguese and it won't allow me to encode into utf-8.

When I do print(str), it comes out properly, but when I try and do stuff with the characters, I get the following error: UnicodeDecodeError: 'utf8' codec can't decode byte.... What do I need to do to get the contents of the string so I can work with it? Thank you.

Edit: actually, the print statement is NOT working correctly, as certain accents are replaced with ? in the print statement.

Is this supposed to be a unicode string with a leading u? Is this Python 2 or 3? What are you doing with the characters? Just trying to access individual ones? — Ray Toal
– Ray Toal, Commented Aug 14, 2011 at 6:15
This looks like it is a Unicode string. Have you tried converting it from Unicode to UTF-8? — cdhowie
– cdhowie, Commented Aug 14, 2011 at 6:15
@cdhowie: It's not Unicode, since it's bytes. It's UTF-16BE. — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Aug 14, 2011 at 6:17

Ignacio Vazquez-Abrams · Accepted Answer · 2011-08-14 06:16:52Z

4

You need to decode it to a unicode first.

>>> '\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n'.decode('utf-16be')
u'ISTA\r\n[/BODY]\r\n'

If it's from a file then use codecs.open() instead of open(), passing the appropriate encoding.

answered Aug 14, 2011 at 6:16

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

David542 Over a year ago

That gives me kanji. Is there a way for me to find out what encoding the string is in?

Ignacio Vazquez-Abrams Over a year ago

No. You can only try various encodings to see if the decoded byte sequence falls within the intended alphabet.

bluish Over a year ago

@David If it's Portuguese I'd try latin9 (ISO-8859-15).

Collectives™ on Stack Overflow

Cannot properly decode string

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related