Python nested lists replace unicode characters in strings

Question

Trying to replace or strip strings in this list to insert into a database which does not allow them

info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]

I used this code

info = [[x.replace(u'\xa0', u'') for x in l] for l in info]
info = [[y.replace('\u2019s', '') for y in o] for o in info]

the first line worked but the second one not, any suggestions ?

I would also try to figure out why are you getting such a weird string mixing raw bytes and unicode codepoints. — Paulo Bu
– Paulo Bu, Commented Mar 6, 2014 at 15:08
What you should do is byte the bullet and learn how to handle unicode by decoding it when you read the string and then encoding it when you are ready to send it to your database. One place to start is here stackoverflow.com/questions/2365411/… — PyNEwbie
– PyNEwbie, Commented Mar 6, 2014 at 15:09

Aaron Hall · Accepted Answer · 2016-08-14 14:33:02Z

5

Drop the second line and do:

info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]

and see if the results are acceptable. This will attempt to convert all the unicode to ascii and drop any characters that fail to convert. You just want to be sure that if you lose an important unicode character, it's not a problem.

>>> info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]
>>> info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]
>>> info
[['Buffalos League of legends ...', '2012-09-05'], [' RCKIN 0 - 1 WITHACK.nq  ', 'Buffalos League of legends ...', '2012-09-05']]

What's going on:

You have data in your Python program that's Unicode (and that's good.)

>>> u = u'\u2019'

Best practice, for interoperability, is to write Unicode strings out to utf-8. These are the bytes you should be storing in your database:

>>> u.encode('utf-8')
'\xe2\x80\x99'
>>> utf8 = u.encode('utf-8')
>>> print utf8
’

And then when you read those bytes back into your program, you should then decode them:

>>> utf8.decode('utf8')
u'\u2019'
>>> print utf8.decode('utf8')
’

If your database can't handle utf-8 then I would consider getting a new database.

edited Aug 14, 2016 at 14:33

answered Mar 6, 2014 at 15:05

Aaron Hall♦

400k93 gold badges415 silver badges342 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Paulo Bu Over a year ago

I like this better than replacing.

thefourtheye · Accepted Answer · 2014-03-06 15:05:06Z

4

Because in the second form \u2019s is not considered as unicode string. Just prepend u in the replace before that element like this

print [[y.replace(u'\u2019s', '') for y in o] for o in info]]

Output

[[u'Buffalo League of legends ...', u'2012-09-05'],
 [u' RCKIN 0 - 1 WITHACK.nq  ',
  u'Buffalo League of legends ...',
  u'2012-09-05']]

Infact you can chain the replace, like this

[[x.replace(u'\xa0', '').replace(u'\u2019s', '') for x in l] for l in info]

answered Mar 6, 2014 at 15:05

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Collectives™ on Stack Overflow

Python nested lists replace unicode characters in strings

2 Answers 2

What's going on:

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

What's going on:

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related