1

I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.

  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings? If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?

  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".

  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
  • How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

9
  • 3
    python 2 or python 3? Commented Mar 23, 2019 at 9:56
  • @Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3. Commented Mar 23, 2019 at 10:18
  • 1
    "However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3... Commented Mar 23, 2019 at 10:26
  • 2
    Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type. Commented Mar 23, 2019 at 10:35
  • 2
    There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary. Commented Mar 23, 2019 at 10:57

1 Answer 1

2

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e))            # 0
print(sys.getsizeof(e))  # 49

a = 'hello'
print(len(a))            # 5
print(sys.getsizeof(a))  # 54

u = 'hello平仮名'
print(len(u))                 # 8
print(sys.getsizeof(u))       # 90
print(len(u[1:]))             # 7
print(sys.getsizeof(u[1:]))   # 88
print(len(u[:-1]))            # 7
print(sys.getsizeof(u[:-1]))  # 88
print(len(u[:-2]))            # 6
print(sys.getsizeof(u[:-2]))  # 86
print(len(u[:-3]))            # 5
print(sys.getsizeof(u[:-3]))  # 54
print(len(u[:-4]))            # 4
print(sys.getsizeof(u[:-4]))  # 53

j = 'hello😋😋😋'
print(len(j))                 # 8
print(sys.getsizeof(j))       # 108
print(len(j[:-1]))            # 7
print(sys.getsizeof(j[:-1]))  # 104
print(len(j[:-2]))            # 6
print(sys.getsizeof(j[:-2]))  # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

  • Empty string object has an overhead of 49 bytes.
  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even though the length is still 8.
  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.
  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:

j = 'hello😋😋😋'
b = j.encode('utf8')  # b'hello\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b'    
print(len(b))  # 17

So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.

And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.