0

I am creating a dictionary that requires each letter of a string separated by whitespace. I am using join. The problem is when the string contains non-ascii characters. Join breaks them into two characters and the results is garbage.

Example:

>>> word = 'məsjø'
>>> ' '.join(word)

Gives me:

'm \xc9 \x99 s j \xc3 \xb8'

When what I want is:

'm ə s j ø'

Or even:

'm \xc9\x99 s j \xc3\xb8'
3
  • If this is Python 2.x, you need to define that as a Unicode string literal. Commented Jan 26, 2012 at 17:44
  • On my machine, the ' '.join() works flawlessly with Python 3.x. Can you specify which OS/version of Python you're using? Commented Jan 26, 2012 at 17:54
  • Was using 2.7. Just installed 3.2 and ' '.join() works with no problems! Thx. Commented Jan 26, 2012 at 18:14

1 Answer 1

3

You should use unicode strings, i.e.

word = u'məsjø'

And don't forget to set the encoding of your Python source file at the beginning with

# -*- coding: UTF-8 -*-

(Don't even think about using something other than UTF-8. ;))

Update: This only applies to Python < 3. If you're using Python >= 3, you would probably not have run into these problems in the first place. So if upgrading to 3.x is an option, it's the way to go -- it might not be in some cases because of library dependencies etc., unfortunately.

As mentioned in the comments, encoding issues might also result from a differently configured terminal, although that was not the problem here, apparently.

Sign up to request clarification or add additional context in comments.

8 Comments

Or if the word is read from somewhere else, use word.decode('utf-8') to turn it into unicode.
In Python 3, this restriction is removed. Also, it doesn't expressly answer the question.
I was assuming the OP does not use Python 3 because then this error would be unlikely... But you're right, would be nice to know for sure.
@Makoto: If the asker has run that code and got that result, he/she must be using Python 2. And in that situation, using a unicode literal is a perfectly good answer.
decode / encode worked for my 2.7 installation. Installed 3.2 and didn't need any decoding/encoding lines.Thx.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.