How to compare unicode and string in Python?

Question

I have two variables (let's say x and y) that have the following values:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.

Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.

I try to use encode

x.encode('utf-8')

It returns something new (the third version):

'Ko\xc5\xa1ick\xc3\xbd'

And using the following:

print x.encode('utf-8')

returns yet another version:

Ko┼íick├¢

So, I am totally confused. Is there a way to keep everything in the same format?

your y is missing something... I've checked it on my python IDLE debugger, and x is Kosicky and y is Koicky (missing the s). — Neoares
– Neoares, Commented Dec 9, 2015 at 16:08
@Neoares Your IDLE seems to lack the proper glyphs. x is "Košický" here. — tripleee
– tripleee, Commented Dec 9, 2015 at 16:11
I don't use the simple IDLE which ships with Python at all. This was with the basic Python REPL on the OSX command line, but I would expect the same behavior on any modern platform (which oddly still seems to exclude WIndows, or at least some popular versions). — tripleee
– tripleee, Commented Dec 9, 2015 at 17:16

tripleee · Accepted Answer · 2015-12-09 16:10:29Z

2

You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.

if y.decode('iso-8859-1') == x:
    print(u'{0!r} converted to Unicode == {1}".format(y, x))

With your given example, this is not true; but perhaps y is in a different encoding.

In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).

edited Dec 9, 2015 at 16:10

answered Dec 9, 2015 at 15:59

tripleee

192k37 gold badges318 silver badges367 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mark Tolonen · Accepted Answer · 2015-12-09 17:50:49Z

1

You need to know the encoding of the byte string. It looks like windows-1252:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

print x == y.decode('windows-1252')
print x.encode('windows-1252') == y

Output:

True
True

Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.

answered Dec 9, 2015 at 17:50

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

Comments

Serge Ballesta · Accepted Answer · 2015-12-09 16:08:42Z

0

Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.

Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.

In your case:

>>> x.encode('cp1252')
'Ko\x9aick\xfd'

So the y strings seems to be cp1252 encoded.

answered Dec 9, 2015 at 16:08

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Collectives™ on Stack Overflow

How to compare unicode and string in Python?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related