0

I have two variables (let's say x and y) that have the following values:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.

Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.

I try to use encode

x.encode('utf-8')

It returns something new (the third version):

'Ko\xc5\xa1ick\xc3\xbd'

And using the following:

print x.encode('utf-8')

returns yet another version:

KošickÛ

So, I am totally confused. Is there a way to keep everything in the same format?

6
  • your y is missing something... I've checked it on my python IDLE debugger, and x is Kosicky and y is Koicky (missing the s). Commented Dec 9, 2015 at 16:08
  • @Neoares Your IDLE seems to lack the proper glyphs. x is "Košický" here. Commented Dec 9, 2015 at 16:11
  • @tripleee then it's fine :) Anyway, what IDLE do you use? Commented Dec 9, 2015 at 16:13
  • I don't use the simple IDLE which ships with Python at all. This was with the basic Python REPL on the OSX command line, but I would expect the same behavior on any modern platform (which oddly still seems to exclude WIndows, or at least some popular versions). Commented Dec 9, 2015 at 17:16
  • I believe the right encoding is cp1252 ... Commented Dec 9, 2015 at 18:27

3 Answers 3

2

You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.

if y.decode('iso-8859-1') == x:
    print(u'{0!r} converted to Unicode == {1}".format(y, x))

With your given example, this is not true; but perhaps y is in a different encoding.

In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).

Sign up to request clarification or add additional context in comments.

Comments

1

You need to know the encoding of the byte string. It looks like windows-1252:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

print x == y.decode('windows-1252')
print x.encode('windows-1252') == y

Output:

True
True

Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.

Comments

0

Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.

Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.

In your case:

>>> x.encode('cp1252')
'Ko\x9aick\xfd'

So the y strings seems to be cp1252 encoded.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.