Handling unicode characters of http User-agents in python

Question

I am completely new to python but I found a package that I need to use and am testing it. The python package in question is pywurfl.

I have created a simple code based on the example given by reading the User-agent (UA) strings from a column in a simple text file. There are a very large number of UAs (some might have foreign characters). Now the file containing the UAs has been produced with the bash output command ">" and a perl script. For example, perl somescript.pl > outfile.txt.

However, when running the following code in that file I get an error.

#!/usr/bin/python

import fileinput
import sys

from wurfl import devices
from pywurfl.algorithms import LevenshteinDistance


for line in fileinput.input():
    line = line.rstrip("\r\n")    # equiv of chomp
    H = line.split('\t')

    if H[27]=='Mobile':

        user_agent = H[23].decode('utf8')           
        search_algorithm = LevenshteinDistance()
        device = devices.select_ua(user_agent, search=search_algorithm)

        sys.stdout.write( "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (user_agent, device.devid, device.devua, device.fall_back, device.actual_device_root, device.brand_name, device.marketing_name, device.model_name, device.device_os, device.device_os_version, device.mobile_browser, device.mobile_browser_version, device.model_extra_info, device.pointing_method, device.has_qwerty_keyboard, device.is_tablet, device.has_cellular_radio, device.max_data_rate, device.wifi, device.dual_orientation, device.physical_screen_height, device.physical_screen_width,device.resolution_height, device.resolution_width, device.full_flash_support, device.built_in_camera, device.built_in_recorder, device.receiver, device.sender, device.can_assign_phone_number, device.is_wireless_device, device.sms_enabled) + "\n")

    else:
        # do something else
        pass

Here H[23] is the column that has the UA string. but I get an error that looks like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

When I replaced 'utf8' with 'latin1' I got the following error

 sys.stdout.write(................) # with the .... as in the code
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128).

Am I doing anything wrong here? I need to convert the UA string in Unicode because the package is so. I am not too well versed in Unicode, especially in python. How would I handle this error? For instance, find out the UA string that is giving this error so that I can make a more informed question?

Community · Accepted Answer · 2017-05-23 11:51:26Z

2

Looks like you have 2 separate problems.

The first is that you're assuming the input file is utf-8, when it's not. Changing the input coding to latin-1 addresses that issue.

The second issue is that your stdout seems to be set up for ascii output, so the write fails. For that, this question may help.

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Jan 7, 2011 at 17:43

David Gelhar

27.9k3 gold badges69 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Handling unicode characters of http User-agents in python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related