Convert a text file from UTF-8 to ASCII to avoid python UnicodeEncodeError?

Question

I'm getting an encoding error from a script, as follows:

from django.template import loader, Context
t = loader.get_template(filename)
c = Context({'menus': menus})
print t.render(c)
  File "../django_to_html.py", line 45, in <module>
    print t.render(c)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 34935: ordinal not in range(128)

I don't own the script, so I don't have the ability to edit it. The only thing I can do is change the filename supplied so it doesn't contain the Unicode character to which the script is objecting.

This file is a text file that I'm editing in TextMate. What can I do to identify and get rid of the character that the script is barfing on?

Could I use something like iconv, and if so how?

Thanks!

John Machin · Accepted Answer · 2011-02-05 08:37:37Z

3

How to find ALL the nasties in your file:

import unicodedata as ucd
import sys
with open(sys.argv[1]) as f:
    for linex, line in enumerate(f):
        uline = line.decode('UTF-8')
        bad_line = False
        for charx, char in enumerate(uline):
            if char <= u'\xff': continue
            print "line %d, column %d: %s" % (
                linex+1, charx+1, ucd.name(char, '<unknown>'))
            bad_line = True
        if bad_line:
            print repr(uline)
            print

Sample output:

line 1, column 6: RIGHT SINGLE QUOTATION MARK
line 1, column 10: SINGLE LOW-9 QUOTATION MARK
u'yadda\u2019foo\u201abar\r\n'

line 2, column 4: IDEOGRAPHIC SPACE
u'fat\u3000space\r\n'

answered Feb 5, 2011 at 8:37

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

AndiDog · Accepted Answer · 2011-02-04 21:36:32Z

2

I don't know why you're using Django's template engine to create console output, but the Python wiki shows a way to work around this on Windows using a Python-specific environment variable:

set PYTHONIOENCODING=utf_8

This will set stdout/stderr encoding to UTF-8, meaning you can print all Unicode characters. As the command line encoding in Windows is usually not UTF-8, you'll see a UTF-like sequence printed instead of special characters. For example:

>>> print u'\u2019'
ΓÇÖ

answered Feb 4, 2011 at 21:36

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

2 Comments

AP257 Over a year ago

I'm not on Windows unfortunately, I'm on OSX.

AndiDog Over a year ago

@AP257: I don't think that makes a difference. Your problem stays the same - and setting env variables should be possible in Mac OSX, too?!

David Heffernan · Accepted Answer · 2011-02-04 21:25:27Z

1

The character is in position 34935 in the file. The helpful traceback tells you that.

answered Feb 4, 2011 at 21:25

David Heffernan

616k46 gold badges1.1k silver badges1.5k bronze badges

1 Comment

AndiDog Over a year ago

Actually it's the position in the rendered output, not in the template file. But that should help, too.

Ulrich Schwarz · Accepted Answer · 2011-02-05 07:28:22Z

0

\u2019 is a right single quotation mark (http://www.unicode.org/charts/ has a helpful search box where you can enter the code), maybe that'll help track it down. If your file ends up in HTML again, you could maybe use the ’ notation for these characters. (As John points out, this accepts hex notation.)

edited Feb 5, 2011 at 7:28

answered Feb 4, 2011 at 21:34

Ulrich Schwarz

7,7331 gold badge39 silver badges49 bronze badges

2 Comments

John Machin Over a year ago

No need to convert; use &#x2019

Ulrich Schwarz Over a year ago

@John: Cheeers, hadn't come across that one!

Collectives™ on Stack Overflow

Convert a text file from UTF-8 to ASCII to avoid python UnicodeEncodeError?

4 Answers 4

Comments

2 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related