2

I've got a list of strings, along the lines of list=[a,b,c,d,e].

When I call list[2], the string c is displayed as ASCII; when I call print list[2], however, it's displayed as unicode. Why does this discrepancy exist?

2
  • For similar reasons to why "123" displays differently than print "123". Commented Feb 9, 2016 at 17:47
  • 2
    Could you show an unedited transcript of the phenomenon, please? We don't know what you mean by "calling" - neither strings nor print statements are "callable" in Python jargon - and we also don't know what you mean by "ascii" and "unicode". Commented Feb 9, 2016 at 17:47

2 Answers 2

3

This is mainly because strings in Python 2 are not text strings but byte strings.

I suppose you are in a REPL environment (a Python console). When you evaluate something in the console, you get its printed representation which is the same as calling print repr() on the expression:

l = ['ñ']
l[0] # should output '\xc3\xb1'
print repr(l[0]) # should output the same

This is because your console is in UTF-8 mode (if you get a different representation for ñ it is because your console uses some other text representation) so when you press ñ you are actually entering two bytes 0xc3 and 0xb1.

repr() is a Python method that always returns a string. For primitive types, this string is a valid source to rebuild the value passed as parameter. This case it returns a string with a sequence of bytes that recreates another string with the ñ encoded as UTF-8. To see this:

repr(l[0]) # should print a string within a string: "'\\xc3\\xb1'"

So when you print it (which is the same as just evaluating in the console), you get the same string without the outer quotes and the escaped characters properly replaced. I.e:

print repr(l[0]) # should output '\xc3\xb1'

But, when you print the value, i.e: print l[0], then you send those two bytes to the console. As the console is in UTF-8 mode, it decodes the sequence and translate it to only one character: ñ. So:

print l[0] # should output ñ

If you want to store text strings, you must use the modifier u before the string. This way:

text = u'ñ'

Now, when evaluating text you will see its Unicode codepoint:

text # should output u'\xf1'

And printing it should recreate the ñ glyph:

print text # should output `ñ`

If you want to convert text into a byte string representation, you need an encoding scheme (such as UTF-8):

text.encode('utf-8') == l[0] # should output True

Similarly, it you want the Unicode representation for l[0], you'll need to decode those bytes:

l[0].decode('utf-8') == text # should output True

All this said, notice in Python 3, default strings are indeed Unicode Strings and you need to prefix the literal notation with b to produce byte strings.

Sign up to request clarification or add additional context in comments.

Comments

2

It's because those two ways of displaying a string use different routes to get to the final result. x by itself in the REPL will invoke repr(x) and display that, but print(x) will invoke str(x) and display that instead. Classes are allowed to define __repr__ and __str__ separately, so they don't always return the same value.

>>> x = u"a"
>>> x
u'a'
>>> print x
a
>>> repr(x)
"u'a'"
>>> str(x)
'a'
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.