Based on my own readings (including this article), it seems that by default Python encodes with UTF-8. Strings are read in under the assumption that they're in UTF-8 encoding (more source).
Those strings are then translated to plain Unicode, using Latin-1, UCS-2, or UCS-4 for the entire string depending on the highest code point of UTF-8 it encounters. This seems to match what I've done on the terminal. The character Ǧ has Unicode code point of 486, and can only be fit in UCS-2.
string1 = "Ǧ"
sys.getsizeof(string1) # This prints 76
string1 = "Ǧa"
sys.getsizeof(string1) # This prints 78, as if 'a' takes two bytes
string2 = "a"
sys.getsizeof(string2) # This prints 50
string2 = "aa"
sys.getsizeof(string2) # This prints 51, as if 'a' takes one byte
I have two questions. First off, when printing to terminal, what is the process with which strings are encoded and decoded? If we call print(), are the strings first encoded to UTF-8 (from UCS-2 or Latin-1 in our examples), where the system decodes it to print to screen? Second off, what's with the large initial increment in the size? Why do strings represented with Latin-1 have an initial size of 49, while strings with UCS-2 have an initial size of 74?
Thanks!