Cannot split a unicode string without converting to ascii - python 2.7

Question

I want to split the string I have £300 but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.

Is there any other way to split such a unicode string without breaking it as in the snippet bellow.

# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()

#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']

Strings are ASCII in Python 2.

juanpa.arrivillaga
– juanpa.arrivillaga

2016-08-29 23:03:58 +00:00
Commented Aug 29, 2016 at 23:03 — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Aug 29, 2016 at 23:03
Ok, but how do i split in the way I want?

Brana
– Brana

2016-08-29 23:04:47 +00:00
Commented Aug 29, 2016 at 23:04 — Brana
– Brana, Commented Aug 29, 2016 at 23:04

John1024 · Accepted Answer · 2016-08-29 23:04:54Z

3

You are looking at a limitation of the way python 2 displays data.

Using python 2:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']

But, observe that it will print as you want:

>>> print(mystring.split()[2])
£300.

Using python 3, by contrast, it displays as you would like:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']

A major reason to use python 3 is its superior handling of unicode.

answered Aug 29, 2016 at 23:04

John1024

115k15 gold badges152 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Brana Over a year ago

Any workaround - as I have python 2.6.6 on my server?

John1024 Over a year ago

@Brana If you print the string itself, as opposed to a list which contains it, then it will display as you want.

Brana Over a year ago

In got that but there is a problem with other things such as processing the string, is it maybe possible to set default encoding to be utf-8?

roeland Over a year ago

@Brana As an aside, figuring out if two unicode strings are "equal" is a non-trivial problem by itself.

roeland Over a year ago

@Brana tchrist wrote of the best posts ever about Unicode ever on SO. It was about a Perl question but most of the answer generally applies to any program using Unicode. → 🌴 🐪🐫🐪🐫🐪 🌞 𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪 🐁 (scroll down to the laundry list, and to "𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤")

|

alexis · Accepted Answer · 2016-08-29 23:24:57Z

1

The problem is not with split(). The real problem is that the handling of unicode in python 2 is confusing.

The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £. You can confirm this by displaying the repr of your original string:

>>> mystring
'I have \xc2\xa3300.'

The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:

>>> mystring = u'I have £300.'

A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.

answered Aug 29, 2016 at 23:24

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Collectives™ on Stack Overflow

Cannot split a unicode string without converting to ascii - python 2.7

2 Answers 2

16 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

16 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related