Python convert and save unicode string to a list

Question

I need to insert a series of names (like 'Alam\xc3\xa9') into a list, and than I have to save them into a SQLite database.

I know that I can render these names correctly by tiping:

print eval(repr(NAME)).decode("utf-8")

But I have to insert them into a list, so I can't use the print

Other way for doing this without the print?

Are you trying to store bytes or characters in the database? — wberry
– wberry, Commented Oct 14, 2011 at 15:31

Daniel Roseman · Accepted Answer · 2011-10-14 19:51:57Z

6

Lots and lots of misconceptions here.

The string you quote is not Unicode. It is a byte string, encoded in UTF-8.

You can convert it to Unicode by decoding it:

unicode_name = name.decode('utf-8')

When you print the value of unicode_name to the console, you will see one of two things:

>>> unicode_name
u'Alam\xe9'
>>> print unicode_name
Alamé

Here, you can see that just typing the name and pressing enter shows a representation of the Unicode code points. This is the same as typing print repr(unicode_name). However, doing print unicode_name prints the actual characters - ie behind the scenes, it encodes it to the correct encoding for your terminal, and prints the result.

But this is all irrelevant, because Unicode strings can only be represented internally. As soon as you want to store it in a database, or a file, or anywhere, you need to encode it. And the most likely encoding to choose is UTF-8 - which is what it was in originally.

>>> name
'Alam\xc3\xa9'
>>> print name
Alamé

As you can see, using the original non-decoded version of the name, repr and print once again show the codes and the characters. So it's not that converting it to Unicode actually makes it any more "really" the correct character.

So, what to do if you want to store it in a database? Nothing. Nothing at all. Sqlite accepts UTF-8 input, and stores its data in UTF-8 format on the disk. So there is absolutely no conversion needed to store the original value of name in the database.

answered Oct 14, 2011 at 19:51

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Marco L. Over a year ago

thank you very much...now I understand a little more... One last thing: now everything it's ok, but with only one exception: \u00f2 is printed as it is, instead of ò. Do you know why?

retracile · Accepted Answer · 2011-10-14 15:22:18Z

0

Are you looking for something like this?

[n.decode("utf-8") for n in ['Alam\xc3\xa9', 'Alam\xc3\xa9', 'Alam\xc3\xa9']]

answered Oct 14, 2011 at 15:22

retracile

12.4k4 gold badges38 silver badges42 bronze badges

3 Comments

retracile Over a year ago

Which is the same thing eval(repr('Alam\xc3\xa9')).decode("utf-8") will produce. What are you trying to do?

Marco L. Over a year ago

exactly, infact also eval(repr('Alam\xc3\xa9')).decode("utf-8") it's incorrect...the trick is made by the print before it

Thomas K Over a year ago

The print statement is just attempting to display the unicode characters, whereas repr() doesn't (in Python 2). u'\x39' is just how the character é appears in a repr. So that is what you want to save.

Collectives™ on Stack Overflow

Python convert and save unicode string to a list

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related