2

Trying to replace or strip strings in this list to insert into a database which does not allow them

info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]

I used this code

info = [[x.replace(u'\xa0', u'') for x in l] for l in info]
info = [[y.replace('\u2019s', '') for y in o] for o in info]

the first line worked but the second one not, any suggestions ?

2
  • I would also try to figure out why are you getting such a weird string mixing raw bytes and unicode codepoints. Commented Mar 6, 2014 at 15:08
  • 1
    What you should do is byte the bullet and learn how to handle unicode by decoding it when you read the string and then encoding it when you are ready to send it to your database. One place to start is here stackoverflow.com/questions/2365411/… Commented Mar 6, 2014 at 15:09

2 Answers 2

5

Drop the second line and do:

info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]

and see if the results are acceptable. This will attempt to convert all the unicode to ascii and drop any characters that fail to convert. You just want to be sure that if you lose an important unicode character, it's not a problem.

>>> info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]
>>> info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]
>>> info
[['Buffalos League of legends ...', '2012-09-05'], [' RCKIN 0 - 1 WITHACK.nq  ', 'Buffalos League of legends ...', '2012-09-05']]

What's going on:

You have data in your Python program that's Unicode (and that's good.)

>>> u = u'\u2019'

Best practice, for interoperability, is to write Unicode strings out to utf-8. These are the bytes you should be storing in your database:

>>> u.encode('utf-8')
'\xe2\x80\x99'
>>> utf8 = u.encode('utf-8')
>>> print utf8
’

And then when you read those bytes back into your program, you should then decode them:

>>> utf8.decode('utf8')
u'\u2019'
>>> print utf8.decode('utf8')
’

If your database can't handle utf-8 then I would consider getting a new database.

Sign up to request clarification or add additional context in comments.

1 Comment

I like this better than replacing.
4

Because in the second form \u2019s is not considered as unicode string. Just prepend u in the replace before that element like this

print [[y.replace(u'\u2019s', '') for y in o] for o in info]]

Output

[[u'Buffalo League of legends ...', u'2012-09-05'],
 [u' RCKIN 0 - 1 WITHACK.nq  ',
  u'Buffalo League of legends ...',
  u'2012-09-05']]

Infact you can chain the replace, like this

[[x.replace(u'\xa0', '').replace(u'\u2019s', '') for x in l] for l in info]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.