UnicodeDecodeError when concatenating strings

Question

I've got the following little Python 2.7 script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower()

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

where, as you can see, I first figure out the country where the "8.8.8.8" IP is located (this returns "us" - see below) and then I concatenate a short string to it which contains some Russian characters.

Result:

# ./script.py
us
Traceback (most recent call last):
   File "./script.py", line 12, in <module>
    result += "Роман"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Now, if I try the following instead

#!/usr/bin/python
# -*- coding: utf-8 -*-

result = "us"
print result
result += "Роман"
print result

Then everything's ok:

./script.py 
us
usРоман

Obviously then, the 'ret_country_iso()' function returns something different than the literal "us" string, my Python is too poor though to say.

How to correct the above?

EDIT: following the advice of snakecharmerb, the following works:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower().encode('utf-8')

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

result might be a unicode object; does result += u"Роман" work? — snakecharmerb
– snakecharmerb, Commented Jan 10, 2022 at 9:52
sadly no, but it does change the error I get to "UnicodeEncodeError: 'latin-1' codec can't encode characters..." — Leszek
– Leszek, Commented Jan 10, 2022 at 9:55
Is that on the result += line or the second print result line? — snakecharmerb
– snakecharmerb, Commented Jan 10, 2022 at 9:56
yes, you're right - now the error is on the second 'print' - didn't notice that... — Leszek
– Leszek, Commented Jan 10, 2022 at 9:57
You probably want str_result = result.encode('utf-8') (you can use other encodings, but they must be able to handle cyrillic characters) — snakecharmerb
– snakecharmerb, Commented Jan 10, 2022 at 10:17

snakecharmerb · Accepted Answer · 2022-01-10 10:42:06Z

2

Python 2 does not strictly distinguish between unicode and bytes, so the results of concatenating the two types are inconsistent:

u'abc' + 'def'

succeeds, but

u'US' + 'Роман'

results in an exception. The usual approach - the "Unicode Sandwich" pattern - is to decode and encode string-type data at the edges of an application, and work only with unicode within the application (for applications which deal primarily with bytes the reverse pattern is adopted).

So, when combining str and unicode instances you can take either of these options:

# unicode result
u'US ' + 'Роман'.decode('utf-8')

# str result
u'US '.encode('utf-8') + 'Роман'

but the key is to be consistent throughout your code, otherwise you will end up with a lot of errors.

Python 3 is stricter about separating the two types; if possible you should consider using it both for better unicode handling and because Python 2 is no longer supported.

answered Jan 10, 2022 at 10:42

snakecharmerb

57.1k13 gold badges136 silver badges200 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tripleee Over a year ago

Perhaps see also nedbatchelder.com/text/unipain.html

Collectives™ on Stack Overflow

UnicodeDecodeError when concatenating strings

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related