1

I want to split the string I have £300 but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.

Is there any other way to split such a unicode string without breaking it as in the snippet bellow.

# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()

#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']
2
  • Strings are ASCII in Python 2. Commented Aug 29, 2016 at 23:03
  • Ok, but how do i split in the way I want? Commented Aug 29, 2016 at 23:04

2 Answers 2

3

You are looking at a limitation of the way python 2 displays data.

Using python 2:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']

But, observe that it will print as you want:

>>> print(mystring.split()[2])
£300.

Using python 3, by contrast, it displays as you would like:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']

A major reason to use python 3 is its superior handling of unicode.

Sign up to request clarification or add additional context in comments.

16 Comments

Any workaround - as I have python 2.6.6 on my server?
@Brana If you print the string itself, as opposed to a list which contains it, then it will display as you want.
In got that but there is a problem with other things such as processing the string, is it maybe possible to set default encoding to be utf-8?
@Brana As an aside, figuring out if two unicode strings are "equal" is a non-trivial problem by itself.
@Brana tchrist wrote of the best posts ever about Unicode ever on SO. It was about a Perl question but most of the answer generally applies to any program using Unicode. → 🌴 🐪🐫🐪🐫🐪 🌞 𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪 🐁 (scroll down to the laundry list, and to "𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤")
|
1

The problem is not with split(). The real problem is that the handling of unicode in python 2 is confusing.

The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £. You can confirm this by displaying the repr of your original string:

>>> mystring
'I have \xc2\xa3300.'

The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:

>>> mystring = u'I have £300.'

A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.