1

I've got the following little Python 2.7 script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower()

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

where, as you can see, I first figure out the country where the "8.8.8.8" IP is located (this returns "us" - see below) and then I concatenate a short string to it which contains some Russian characters.

Result:

# ./script.py
us
Traceback (most recent call last):
   File "./script.py", line 12, in <module>
    result += "Роман"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Now, if I try the following instead

#!/usr/bin/python
# -*- coding: utf-8 -*-

result = "us"
print result
result += "Роман"
print result

Then everything's ok:

./script.py 
us
usРоман

Obviously then, the 'ret_country_iso()' function returns something different than the literal "us" string, my Python is too poor though to say.

How to correct the above?

EDIT: following the advice of snakecharmerb, the following works:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower().encode('utf-8')

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result
8
  • result might be a unicode object; does result += u"Роман" work? Commented Jan 10, 2022 at 9:52
  • sadly no, but it does change the error I get to "UnicodeEncodeError: 'latin-1' codec can't encode characters..." Commented Jan 10, 2022 at 9:55
  • Is that on the result += line or the second print result line? Commented Jan 10, 2022 at 9:56
  • yes, you're right - now the error is on the second 'print' - didn't notice that... Commented Jan 10, 2022 at 9:57
  • 1
    You probably want str_result = result.encode('utf-8') (you can use other encodings, but they must be able to handle cyrillic characters) Commented Jan 10, 2022 at 10:17

1 Answer 1

2

Python 2 does not strictly distinguish between unicode and bytes, so the results of concatenating the two types are inconsistent:

u'abc' + 'def'

succeeds, but

u'US' + 'Роман'

results in an exception. The usual approach - the "Unicode Sandwich" pattern - is to decode and encode string-type data at the edges of an application, and work only with unicode within the application (for applications which deal primarily with bytes the reverse pattern is adopted).

So, when combining str and unicode instances you can take either of these options:

# unicode result
u'US ' + 'Роман'.decode('utf-8')

# str result
u'US '.encode('utf-8') + 'Роман'

but the key is to be consistent throughout your code, otherwise you will end up with a lot of errors.

Python 3 is stricter about separating the two types; if possible you should consider using it both for better unicode handling and because Python 2 is no longer supported.

Sign up to request clarification or add additional context in comments.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.