0

I am trying to create a binary converter with Python, but I encounter some strange codes:

>>> print '\x97'
—
>>> print '\x96'
–
>>> print '\x94'
”
>>> print '\x95'
•

What is that encoding called?

2 Answers 2

2

That encoding could be ANY of the nine Windows single-byte "ANSI" encodings, cp1250 to cp1258 inclusive:

>>> guff = "\x97\x96\x94\x95"
>>> uguff0 = guff.decode('1250')
>>> all(guff.decode(str(e)) == uguff0 for e in xrange(1251, 1259))
True

Usage:

1250: Central/Eastern Europe languages with Latin-based alphabets e.g. Polish, Czech, Slovak, Hungarian
1251: Cyrillic alphabet e.g. Russian
1252: Western European languages with Latin-based alphabets
The others are single-language encodings for Turkish, Greek, Hebrew, Arabic, and Vietnamese.

To find out what is in use on your computer:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Here's what the codes mean:

>>> from unicodedata import name
>>> for c in uguff0:
...     print repr(c), name(c)
...
u'\u2014' EM DASH
u'\u2013' EN DASH
u'\u201d' RIGHT DOUBLE QUOTATION MARK
u'\u2022' BULLET
>>>
Sign up to request clarification or add additional context in comments.

Comments

1

That would be hex encoding. It means take the hex value 97, which is 151 in decimal, and use that character inside the string.

Character 151 is the em-dash, 150 is the en-dash, 148 is the end-double-quote and 149 is the bullet point, as shown here, keeping in mind that these characters are not Unicode code points (as stated) but Windows code page characters.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.