python unicode issues in both python2 and python3

Question

I have a set of python scripts (https://github.com/hvdwolf/wikiscripts) that parse wikidumps to make it into gpx/osm/csv/sql/sqlite dumps to be used as POI files in Nav apps. I only parse the articles that have coordinates. For this I use the externallinks dumps that contain sql insert statements. The sql statements containing the "geohack.php" substring do contain the coordinates. I import these in an sqlite database to be used as reference for the article dumps. They are all utf-8 dumps and parsing all "western type" files works fine, but the languages like arabic, farsi, russian, japanese, greek, chinese and the others don't work. Obviously I'm doing something wrong.

The strings I get for the titles are:

%D9%85%D8%A7%D9%81%D8%B8%D8%A9_%D8%A7%D9%84%D8%A8%D8%AF%D8%A7%D8%A6%D8%B9 %D8%A3%D9%88%D8%B1%D9%8A%D9%88%D9%8A%D9%84%D8%A7 Battle_of_Nicopolis
Qingdao

So some normal characters are OK. The rest is gibberish (to me). I already did some test where I simply read the dump and write to utf-8 encoded text file (line in => line out) and then it works fine, but somewhere in the string handling functions and "re." functions it changes my unicode text.

Edit: My python script starts with: # -- coding: utf-8 --
My code (the relevant part, including python2 and python3 statements, and some remarks to display what I already tried):

with gzip.open(externallinks_file, 'r') as single_externallinksfile:
#reader = codecs.getreader("utf-8")
#single_externallinksfile = reader(single_externallinksfile)
#with codecs.getreader('utf-8')gzip.open(externallinks_file, 'r') as single_externallinksfile:
linecounter = 0
totlinecounter = 0
filelinecounter = 0
# We need to read line by line as we have massive files, sometimes multiple GBs
for line in single_externallinksfile:
    if sys.version_info<(3,0,0):
        line = unicode(line, 'utf-8')
    else:
        line = line.decode("utf-8")
    if "INSERT INTO" in line:
        insert_statements = line.split("),(")
        for statement in insert_statements:
            #statement = statement.decode("utf-8")
            filelinecounter += 1
            #if ("geohack.php?" in statement) and (("pagename" in statement) or ("src=" in statement)): 
            # src can also be in the line, but is different and we leave it out for now
            if ("geohack.php?" in statement) and ("pagename" in statement) and ("params" in statement):
                language = ""
                region = ""
                poitype = ""
                content = re.findall(r'.*?pagename=(.*?)\'\,\'',statement,flags=re.IGNORECASE)
                if len(content) > 0: # We even need this check due to corrupted lines
                    splitcontent = content[0].split("&")
                    title = splitcontent[0]
                    #title = title.decode('utf8')
                    for subcontent in splitcontent:
                        if "language=" in subcontent:
                            language = subcontent.replace("language=","")
                            #print('taal is: ' + language)
                        if "params=" in subcontent:
                            params_string = subcontent.replace("params=","").split("_")
                            latitude,longitude,poitype,region = get_coordinates_type_region(params_string)
                    if ( str(latitude) != "" and str(longitude) != "" and  (str(latitude) != "0") or (str(longitude) != "0")):
                        if GENERATE_SQL == "YES":
                            sql_file.write('insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region + '");\n')
                        if CREATE_SQLITE == "YES":
                            sqlcommand = 'insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region +'");'
                            #print(sqlcommand)
                            cursor.execute(sqlcommand)
                        linecounter += 1
                        if linecounter == 10000:
                            if CREATE_SQLITE == "YES":
                                # Do a databse commit every 10000 rows
                                wikidb.commit()
                            totlinecounter += linecounter
                            linecounter = 0
                            print('\nProcessed ' + str(totlinecounter) + ' lines out of ' + str(filelinecounter) + ' sql line statements. Elapsed time: ' + str(datetime.datetime.now().replace(microsecond=0) - start_time))

unutbu · Accepted Answer · 2015-05-09 11:25:48Z

1

It looks like the titles are percent-encoded.

try:
    # Python 3
    from urllib.parse import unquote
except ImportError:
    # Python 2
    from urllib import unquote

percent_encoded = '''
%D9%85%D8%A7%D9%81%D8%B8%D8%A9_%D8%A7%D9%84%D8%A8%D8%AF%D8%A7%D8%A6%D8%B9
%D8%A3%D9%88%D8%B1%D9%8A%D9%88%D9%8A%D9%84%D8%A7
Battle_of_Nicopolis
Qingdao
'''
print(unquote(percent_encoded))

yields

مافظة_البدائع
أوريويلا
Battle_of_Nicopolis
Qingdao

edited May 9, 2015 at 11:25

answered May 9, 2015 at 11:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Harry van der Wolf Over a year ago

Thank you very much. That did it! I tried many decode/encode options but I had never heard of percent encoded.

jfs Over a year ago

@HarryvanderWolf: percent-encoding is often used in urls. And very similar (%20 -> +) application/x-www-form-urlencoded content type had been often used to submit content via a web form (over http) in the past.

Harry van der Wolf Over a year ago

I know the %20 and other codings in urls. I simply never made the connection between "some of those characters" and sentences with only those characters.

Collectives™ on Stack Overflow

python unicode issues in both python2 and python3

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related