0

It seems i've run a problem with the encoding itself in where i need to pass Bing translation junks..

def _unicode_urlencode(params):
    if isinstance(params, dict):
        params = params.items()
    return urllib.urlencode([(k, isinstance(v, unicode) and v.encode('utf-8') or v) for k, v in params])

def _run_query(args):
        data = _unicode_urlencode(args)
        sock = urllib.urlopen(api_url + '?' + data)
        result = sock.read()
        if result.startswith(codecs.BOM_UTF8):
                result = result.lstrip(codecs.BOM_UTF8).decode('utf-8')
        elif result.startswith(codecs.BOM_UTF16_LE):
                result = result.lstrip(codecs.BOM_UTF16_LE).decode('utf-16-le')
        elif result.startswith(codecs.BOM_UTF16_BE):
                result = result.lstrip(codecs.BOM_UTF16_BE).decode('utf-16-be')
        return json.loads(result)

def set_app_id(new_app_id):
        global app_id
        app_id = new_app_id

def translate(text, source, target, html=False):
        """
        action=opensearch
        """
        if not app_id:
                raise ValueError("AppId needs to be set by set_app_id")
        query_args = {
                'appId': app_id,
                'text': text,
                'from': source,
                'to': target,
                'contentType': 'text/plain' if not html else 'text/html',
                'category': 'general'
        }
        return _run_query(query_args)
...
text = translate(sys.argv[2], 'en', 'tr')
HOST = '127.0.0.1'
PORT = 894
s = socket.socket()
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.connect((HOST, PORT))
s.send("Bing translation: " + text.encode('utf8') + "\r");
s.close()

As you can see, if the translated text contains some turkish characters, the script fails to send the text to the socket..

Do you have any idea on how to get rid of this?

Regards.

5
  • did you try encoding the whole string? text = "Bing translation: " + text + "\r"; s.send(text.encode('utf8'); also did you use text.decode('utf8') on the receiving end. Commented Jul 4, 2013 at 23:29
  • Do you get an error message? If so, what? Commented Jul 4, 2013 at 23:29
  • it didn't work. btw, receiving end is a server which created with c.. Commented Jul 4, 2013 at 23:35
  • Also, which version of Python are you using? Commented Jul 4, 2013 at 23:37
  • Python 2.5.2.. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128) or something. Commented Jul 4, 2013 at 23:41

2 Answers 2

2

Your problem is entirely unrelated to the socket. text is already a bytestring, and you're trying to encode it. What happens is that Python tries to converts the bytestring to a unicode via the safe ASCII encoding in order to be able to encode as UTF-8, and then fails because the bytestring contains non-ASCII characters.

You should fix translate to return a unicode object, by using a JSON variable that returns unicode objects.

Alternatively, if it is already encoding text encoded as UTF-8, you can simply use

s.send("Bing translation: " + text + "\r")
Sign up to request clarification or add additional context in comments.

3 Comments

i added the translate code to the OP. I am not sure how to fix that. Can you explain it a bit more since i am newbie.. Thanks
@jamall55 The code you posted shows that most likely the JSON library is at fault. Since it is not in the standard library in 2.5 (you should really use a newer Python version, but I digress), which json library are you using here? And what don't you understand in this answer, i.e. what should I elaborate on?
i was able to get it done. It was all about encoding the string two times with the wrong one coming second. Thanks.
-1
# -*- coding:utf-8 -*-

 text = u"text in you language"
 s.send(u"Bing translation: " + text.encode('utf8') + u"\r");

This must work. text must be spelled in utf-8 encoding.

5 Comments

didn't work out. s.send(u"CNN Bing translation: " + text.encode('utf8') + u"\r"); UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)
1. what is your source file encoding?
-1 Apart from broken indentation in this answer, there is absolutely no reason why you would ever want to send unicode over a socket.
it's utf8. All i wanna do is send the translation junk that i got from bing through the socket..
it can't be. u'your languge string' is equal to unicode('your language string', 'encoding of your source file'). Then you might be wanting to convert it to some encoding, ustr.encode('utf-8')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.