0

I have a string from reading a .txt file that looks something like this:

str='\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n\x00'

The contents of the file is in Portuguese and it won't allow me to encode into utf-8.

When I do print(str), it comes out properly, but when I try and do stuff with the characters, I get the following error: UnicodeDecodeError: 'utf8' codec can't decode byte.... What do I need to do to get the contents of the string so I can work with it? Thank you.

Edit: actually, the print statement is NOT working correctly, as certain accents are replaced with ? in the print statement.

3
  • Is this supposed to be a unicode string with a leading u? Is this Python 2 or 3? What are you doing with the characters? Just trying to access individual ones? Commented Aug 14, 2011 at 6:15
  • This looks like it is a Unicode string. Have you tried converting it from Unicode to UTF-8? Commented Aug 14, 2011 at 6:15
  • 2
    @cdhowie: It's not Unicode, since it's bytes. It's UTF-16BE. Commented Aug 14, 2011 at 6:17

1 Answer 1

4

You need to decode it to a unicode first.

>>> '\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n'.decode('utf-16be')
u'ISTA\r\n[/BODY]\r\n'

If it's from a file then use codecs.open() instead of open(), passing the appropriate encoding.

Sign up to request clarification or add additional context in comments.

3 Comments

That gives me kanji. Is there a way for me to find out what encoding the string is in?
No. You can only try various encodings to see if the decoded byte sequence falls within the intended alphabet.
@David If it's Portuguese I'd try latin9 (ISO-8859-15).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.