What is Encoding
A file consists of bytes. You can represent each byte with a number between 0 and 255 (or 0x00 and 0xFF in hexadecimal).
Text is also written as bytes. There is an agreement on the way text is written. That is an encoding. The most basic encoding is ASCII and other encodings are usually based on it. For example, ASCII defines that number 65 (0x41) represents 'A', 66 (0x42) represents 'B' etc.
How are Strings Represented
In python, you can define a string using numeric values:
>>> '\x41\x42\x43'
'ABC'
'\x41\x42\x43' is exactly the same thing as 'ABC'. Python will always represent the string using the more readable textual representation ('ABC').
However, some numeric values are not printable characters, so they will be represented in numeric form:
>>> '\x00\x01\x02\x03\x04'
'\x00\x01\x02\x03\x04'
Others characters have aliases to make your job easier:
>>> '\x0a\x0d\x09'
'\n\r\t'
Different Encodings
ASCII table defines meaning of numbers 0-127 and includes only the english alphabet. Numbers 128-255 are undefined. So, other encodings define a meaning for 128-255. Yet others change the meaning of the whole range 0-255.
There are many encodings and they define 128-255 differently.
For example, character 185 (0xB9) is ą in windows-1250 encoding, but it is š in iso-8859-2 encoding.
So, what happens if you print \xb9? It depends on the encoding used in the console. In my case (my console uses cp852 encoding) it is:
>>> print '\xb9'
╣
Because of that ambiguity, string '\xb9' will never be represented as '╣' (nor 'ą'...). That would hide the true value. It will be represented as the numeric value:
>>> '\xb9'
'\xb9'
Also:
>>> '╣'
'\xb9'
See also the string from the question in my console:
>>> content = '\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> print content
ńŻáňąŻ ńŞşňŤŻ Hello China 1 2 3
But what happens if variable is just entered in the console?
When a variable is enteren in cosole without print, its representation is printed. It is the same as the following:
>>> print repr(content)
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
What is Unicode?
Unicode table aims to define a numeric representation of all characters in the world and more. It can actually do that, because it is not limited to 256 values (or to any other limit actually). This is not an encoding, but a universal mapping of numbers to characters.
For example, unicode defines that number 353 (0x0161) is character š. That is allways true regardless of your locale and encodings you use. That character can be stored in files (or memory) in any encoding which supports š.
What is UTF-8?
When encoding a unicode character, one can use any encoding, but not all of them will support all characters.
For example, š (unicode 0x0161) can be encoded in iso-8869-2 as 0xB9, but it cannot be encoded in iso-8869-1 at all.
So, to be able to encode anything, you need an encoding which supports every unicode character. UTF-8 is one of those encodings, but there are others:
>>> u'\u0161'.encode('utf-7')
'+AWE-'
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> u'\u0161'.encode('utf-16le')
'a\x01'
>>> u'\u0161'.encode('utf-16be')
'\x01a'
>>> u'\u0161'.encode('utf-32le')
'a\x01\x00\x00'
>>> u'\u0161'.encode('utf-32be')
'\x00\x00\x01a'
The good thing about utf-8 is that the whole ASCII range is unchanged and as long as only ASCII is used, only one byte is used per character:
>>> u'abcdefg'.encode('utf-8')
'abcdefg'
Unicode in Python 2
Important: This is really specific to Python 2. Python 3 is different.
Unlike str objects, which are strings of bytes, unicode objects are strings of unicode characters.
They can be encoded into a str in chosen encoding, or decoded from str in chosen encoding.
A unicode string is specified using u before the opening quote. The characters inside are interpreted using current encoding, or they can be specified in numeric format \uHEX:
>>> u'ABCD'
u'ABCD'
>>>
>>> u'\u0041\u0042\u0043'
u'ABC'
>>> u'šâů'
u'\u0161\xe2\u016f'
And Now the Answers
First Question
contents prints repr(contents)
print contents prints contents
Second Question
UTF-8 strings are byte strings (str). You get them by encoding the unicode:
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> '\xc5\xa1'.decode('utf-8')
u'\u0161'
So yes, encode converts unicode to str. The str can be utf-8, but it does not have to be.
Third Question
A) "Why the chinese character turn into utf-8 code when I do .split()?"
They were utf-8 all the time.
B) "I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work"
content_list is not a string. It is a list. When a list is converted to a string, it is done using its repr, which also does repr of all of the contents.
For example:
>>> 'a \n a \n a'
'a \n a \n a'
>>> print 'a \n a \n a'
a
a
a
>>> print ['a \n a \n a']
['a \n a \n a']
The last print printed repr(list) which contains repr(str).