25

Version: Python 2.7.3

Other libraries: Python-Requests 1.2.3, jinja2 (2.6)

I have a script that submits data to a forum and the problem is that non-ascii characters appear as garbage. For instance a name like André Téchiné comes out as André Téchiné.

Here's how the data is submitted:

1) Data is initially loaded from a UTF-8 encoded CSV file like so:

entries = []
with codecs.open(filename, 'r', 'utf-8') as f:
    for row in unicode_csv_reader(f.readlines()[1:]):
        entries.append(dict(zip(csv_header, row)))

unicode_csv_reader is from the bottom of Python CSV documentation page: http://docs.python.org/2/library/csv.html

When I type the entries name in the interpreter, I see the name as u'Andr\xe9 T\xe9chin\xe9'.

2) Next I render the data through jinja2:

tpl = tpl_env.get_template(u'forumpost.html')
rendered = tpl.render(entries=entries)

When I type the name rendered in the interpreter I see again the same: u'Andr\xe9 T\xe9chin\xe9'

Now, if I write the rendered variable to a filename like this, it displays correctly:

with codecs.open('out.txt', 'a', 'utf-8') as f:
    f.write(rendered)

But I must send it to the forum:

3) In the POST request code I have:

params = {u'post': rendered}
headers = {u'content-type': u'application/x-www-form-urlencoded'}
session.post(posturl, data=params, headers=headers, cookies=session.cookies)

session is a Requests session.

And the name is displayed broken in the forum post. I have tried the following:

  • Leave out headers
  • Encode rendered as rendered.encode('utf-8') (same result)
  • rendered = urllib.quote_plus(rendered) (comes out as all %XY)

If I type rendered.encode('utf-8') I see the following:

'Andr\xc3\xa9 T\xc3\xa9chin\xc3\xa9'

How could I fix the issue? Thanks.

2 Answers 2

32

Your client behaves as it should e.g. running nc -l 8888 as a server and making a request:

import requests

requests.post('http://localhost:8888', data={u'post': u'Andr\xe9 T\xe9chin\xe9'})

shows:

POST / HTTP/1.1
Host: localhost:8888
Content-Length: 33
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/1.2.3 CPython/2.7.3

post=Andr%C3%A9+T%C3%A9chin%C3%A9

You can check that it is correct:

>>> import urllib
>>> urllib.unquote_plus(b"Andr%C3%A9+T%C3%A9chin%C3%A9").decode('utf-8')
u'Andr\xe9 T\xe9chin\xe9'
  • check the server decodes the request correctly. You could try to specify the charset:

    headers = {"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"}
    

    the body contains only ascii characters so it shouldn't hurt and the correct server would ignore any parameters for x-www-form-urlencoded type anyway. Look for gory details in URL-encoded form data

  • check the issue is not a display artefact i.e., the value is correct but it displays incorrectly

Sign up to request clarification or add additional context in comments.

6 Comments

"check the issue is not a display artefact i.e., the value is correct but it displays incorrectly" - Thank you. That's the problem! Unfortunately it's a public forum and I can't change the default encoding. It responds with iso-8859-1 encoding. Can I use rendered.encode('iso-8859-1') or will that break things? Thanks.
try to set charset in the headers
Sending it as rendered.encode('iso-8859-1') seemed to work so I'll use that. I marked your answer as correct as it pointed to the right direction. Thanks.
To anyone else who finds this, you can use urllib.parse.quote_from_bytes and urllib.parse.unquote_to_bytes to send a bytes-type over a network without worrying as much about encoding.
@MicahSmith: the question has python-2.7 tag. There is no urllib.parse there. Anyway, the input is Unicode (as it should -- use Unicode to represent text inside your programs). Side-note: unquote_plus() is used here to convience OP that requests.post() works correctly -- you do not use it in your actual code.
|
2

Try to decode into utf8:

unicode(my_string_variable, "utf8")

or decode and encode:

sometext = gettextfromsomewhere().decode('utf-8')
env = jinja2.Environment(loader=jinja2.PackageLoader('jinjaapplication', 'templates'))
template = env.get_template('mypage.html')
print template.render( sometext = sometext ).encode('utf-8')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.