2

What's the best way to convert every string in a list (containing other lists) to unicode in python?

For example:

[['a','b'], ['c','d']]

to

[[u'a', u'b'], [u'c', u'd']]
7
  • Is your list specifically always a list of lists of strings, or is the nesting arbitrary? Commented Aug 9, 2013 at 21:34
  • There are two halves of this question. First, there's "how do I convert a string to Unicode". And if you don't think that's a real question, you definitely need to read Horner6's answer. Second, there's, "assuming I know how I want to convert each string to Unicode, how do I map it across this data structure". If that's the only part you're asking, it would be clearer to show how you want to convert each string. Commented Aug 9, 2013 at 21:35
  • @Peter DeGlopper: yes, it's always a list of lists of strings Commented Aug 9, 2013 at 21:36
  • @user2635863 In python, strings are arrays of bytes. You need to know which encoding they are before you "decode" them into a Unicode structure. Commented Aug 9, 2013 at 21:40
  • I want to use non-english characters, that's why I'd like to convert everything to unicode. Commented Aug 9, 2013 at 21:46

3 Answers 3

3
>>> li = [['a','b'], ['c','d']]

>>> [[v.decode("UTF-8") for v in elem] for elem in li]
[[u'a', u'b'], [u'c', u'd']]
Sign up to request clarification or add additional context in comments.

9 Comments

I think it was better before you added the "with unicode() function" part. That's almost always the wrong thing to do, and your original answer was dead-simple and near-perfect.
@abarnert. Can you explain little further? Is there any difference between the two?
My understanding was that the unicode constructor implicitly decoded. I'm curious to see if there's a difference too.
Your first edit did unicode(v). The difference is obvious: that's using the default encoding, which is usually 'ascii', and almost always wrong. Your second edit changed it to unicode(v, "UTF-8"), which is functionally equivalent to the decode call—a bit less clear, and not future-compatible to 3.x, but not actually bad. But I was responding to the first edit.
@abarnert. Yeah I read that unicode() is not there in 3.x, so removed it. But, may be I'll add it and make a note of it.
|
0

Unfortunately, there isn't an easy answer with unicode. But fortunately, once you understand it, it'll carry with you to other programming languages.

This is, by far, the best resource that I've seen for python unicode:

http://nedbatchelder.com/text/unipain/unipain.html

Use the arrow keys (on your keyboard) to navigate to the next and previous slides.

Also, please take a look at this (and the other links from the end of that slideshow).

http://www.joelonsoftware.com/articles/Unicode.html

8 Comments

What if the strings are "Windows-1252" encoded byte strings? Decoding by guessing that they're UTF-8 is not going to help him. The only thing that will help is a fundamental understanding of text encoding so that he can manage the input and produce an expected result.
@PeterDeGlopper: The two of you have just made different guesses on which half of this problem is the hard part that the OP was (or should have been) asking about. Until we get some clarification from the OP, there's probably not much point arguing about it.
Well, basically I just want to add the "u" letter before each string so that I can use non-english characters. As far as I'm aware the UTF-8 is the character set I should use.
@user2635863: The way you're saying that strongly implies that you should go read that set of slides. (Sorry I couldn't think of a way to fit an Arrested Development/Pop-Pop joke…)
Also, you need to understand the difference between string values and string literals/displays. Adding a u to the start of the string "abc" just gives you the string "uabc"; it doesn't give you the unicode string u"abc". The u isn't part of the string any more than the quotes are.
|
0
>>> l = [['a','b'], ['c','d']]
>>> map(lambda x: map(unicode, x), l)
[[u'a', u'b'], [u'c', u'd']]

2 Comments

This is a bad idea, unless the OP really wants to decode with sys.getdefaultencoding(). And fixing it to take the encoding means either a lambda inside the lambda, or a partial inside the lambda; either way, I think it's much simpler to use a comprehension here.
Thank you for clarification. I'll leave it here in case of the OP wants to go with sys.getdefaultencoding(), and it looks nice and clear.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.