string decode method error in python

Question

I have a function like this:

def convert_to_unicode(data):
    row = {}
    if data == None:
        return data
    try:
        for key, val in data.items():
            if isinstance(val, str):
                row[key] = unicode(val.decode('utf8'))
            else:
                row[key] = val
        return row
    except Exception, ex:
        log.debug(ex)

to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).

Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'} and it does not get transformed. it throws this error:

'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte

When I searched for this it seems that this has to do with ascii encoding.

The solutions i tried are:

adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )

Can anybody point me towards a right direction please.

Mark Amery · Accepted Answer · 2012-12-15 10:31:41Z

Firstly:

The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.

Secondly:

Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one: http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8

I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Zero Piraeus · Accepted Answer · 2012-12-15 08:32:45Z

2

Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.

answered Dec 15, 2012 at 8:32

Zero Piraeus

59.7k28 gold badges158 silver badges164 bronze badges

1 Comment

SRC Over a year ago

Thanks everybody for your help :)

Collectives™ on Stack Overflow

string decode method error in python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related