convert strings in separate lists to unicode - python

Question

What's the best way to convert every string in a list (containing other lists) to unicode in python?

For example:

[['a','b'], ['c','d']]

to

[[u'a', u'b'], [u'c', u'd']]

Is your list specifically always a list of lists of strings, or is the nesting arbitrary? — Peter DeGlopper
– Peter DeGlopper, Commented Aug 9, 2013 at 21:34
There are two halves of this question. First, there's "how do I convert a string to Unicode". And if you don't think that's a real question, you definitely need to read Horner6's answer. Second, there's, "assuming I know how I want to convert each string to Unicode, how do I map it across this data structure". If that's the only part you're asking, it would be clearer to show how you want to convert each string. — abarnert
– abarnert, Commented Aug 9, 2013 at 21:35
@Peter DeGlopper: yes, it's always a list of lists of strings — HappyPy
– HappyPy, Commented Aug 9, 2013 at 21:36
@user2635863 In python, strings are arrays of bytes. You need to know which encoding they are before you "decode" them into a Unicode structure. — Homer6
– Homer6, Commented Aug 9, 2013 at 21:40
I want to use non-english characters, that's why I'd like to convert everything to unicode. — HappyPy
– HappyPy, Commented Aug 9, 2013 at 21:46

Rohit Jain · Accepted Answer · 2013-08-09 21:38:07Z

3

>>> li = [['a','b'], ['c','d']]

>>> [[v.decode("UTF-8") for v in elem] for elem in li]
[[u'a', u'b'], [u'c', u'd']]

edited Aug 9, 2013 at 21:38

answered Aug 9, 2013 at 21:32

Rohit Jain

214k45 gold badges419 silver badges534 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

abarnert Over a year ago

I think it was better before you added the "with unicode() function" part. That's almost always the wrong thing to do, and your original answer was dead-simple and near-perfect.

Rohit Jain Over a year ago

@abarnert. Can you explain little further? Is there any difference between the two?

Homer6 Over a year ago

My understanding was that the unicode constructor implicitly decoded. I'm curious to see if there's a difference too.

abarnert Over a year ago

Your first edit did unicode(v). The difference is obvious: that's using the default encoding, which is usually 'ascii', and almost always wrong. Your second edit changed it to unicode(v, "UTF-8"), which is functionally equivalent to the decode call—a bit less clear, and not future-compatible to 3.x, but not actually bad. But I was responding to the first edit.

Rohit Jain Over a year ago

@abarnert. Yeah I read that unicode() is not there in 3.x, so removed it. But, may be I'll add it and make a note of it.

|

Homer6 · Accepted Answer · 2013-08-09 21:30:22Z

0

Unfortunately, there isn't an easy answer with unicode. But fortunately, once you understand it, it'll carry with you to other programming languages.

This is, by far, the best resource that I've seen for python unicode:

http://nedbatchelder.com/text/unipain/unipain.html

Use the arrow keys (on your keyboard) to navigate to the next and previous slides.

Also, please take a look at this (and the other links from the end of that slideshow).

http://www.joelonsoftware.com/articles/Unicode.html

answered Aug 9, 2013 at 21:30

Homer6

15.2k11 gold badges65 silver badges83 bronze badges

8 Comments

Homer6 Over a year ago

What if the strings are "Windows-1252" encoded byte strings? Decoding by guessing that they're UTF-8 is not going to help him. The only thing that will help is a fundamental understanding of text encoding so that he can manage the input and produce an expected result.

abarnert Over a year ago

@PeterDeGlopper: The two of you have just made different guesses on which half of this problem is the hard part that the OP was (or should have been) asking about. Until we get some clarification from the OP, there's probably not much point arguing about it.

HappyPy Over a year ago

Well, basically I just want to add the "u" letter before each string so that I can use non-english characters. As far as I'm aware the UTF-8 is the character set I should use.

abarnert Over a year ago

@user2635863: The way you're saying that strongly implies that you should go read that set of slides. (Sorry I couldn't think of a way to fit an Arrested Development/Pop-Pop joke…)

abarnert Over a year ago

Also, you need to understand the difference between string values and string literals/displays. Adding a u to the start of the string "abc" just gives you the string "uabc"; it doesn't give you the unicode string u"abc". The u isn't part of the string any more than the quotes are.

|

alecxe · Accepted Answer · 2013-08-09 21:33:17Z

0

>>> l = [['a','b'], ['c','d']]
>>> map(lambda x: map(unicode, x), l)
[[u'a', u'b'], [u'c', u'd']]

answered Aug 9, 2013 at 21:33

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

2 Comments

abarnert Over a year ago

This is a bad idea, unless the OP really wants to decode with sys.getdefaultencoding(). And fixing it to take the encoding means either a lambda inside the lambda, or a partial inside the lambda; either way, I think it's much simpler to use a comprehension here.

alecxe Over a year ago

Thank you for clarification. I'll leave it here in case of the OP wants to go with sys.getdefaultencoding(), and it looks nice and clear.

Collectives™ on Stack Overflow

convert strings in separate lists to unicode - python

3 Answers 3

9 Comments

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related