Python Non-ASCII Characters with Encoding Declared

Question

I'm having an issue with Python2.7 complaining I do not have encoding declared; however, it is infact declared. I'm running this on OS X El Capitan (10.11.3) and python 2.7.11.

I'm attempting to search a data set for specific Chinese and english terms. The report.csv contains the data which I want to search and the raw_terms.txt contains the Chinese and English terms in new line separated. Both files were saved as UTF-8.

I've noticed this code works on different machines, but not mine. I'm assuming there is something I have changed in the year+ I've had this laptop which is causing this issue, but I'm unsure where to start my search.

Script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import csv

count = 0
with open('./data/report.csv', 'rb') as c:
    csv_data = csv.DictReader(c, delimiter=',', quoting=csv.QUOTE_ALL)
    for data in csv_data:
        with open('./terms/raw_terms.txt', 'r') as f:
            for term in f:
                term = term.strip()
                if term in data['Description']: #or term in '你好！你好吗':
                    # print 'Found \"%s\" in \"%s\"' % (term, data['Subject'])
                    count += 1
                else:
                    continue

print count

Error:

File "t.py", line 1
SyntaxError: Non-ASCII character '\xfe' in file t.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Appreciate any help/direction anyone can provide.

You have not declared an encoding. The coding: comment applies to your code, not to the file you are opening from the code. Anyway, the CSV module has well-documented trouble with Unicode -- look for the many, many duplicates here. — tripleee
– tripleee, Commented Mar 16, 2016 at 15:21
I have tried numerous other techniques including using the codecs (codecs.open(file_location, 'rb', 'UTF-8') as f:) module and .encode('unicode-escape'). Also, it's the non-csv file which I'm getting the error for. — Tom
– Tom, Commented Mar 16, 2016 at 15:24
The error message suggests that your source file has a Unicode BOM, and that it is in fact not in UTF-8. If it were, the first character would be \xef, not \xfe. Probably your file is in UTF-16. Try to save it as UTF-8 without a BOM. — tripleee
– tripleee, Commented Mar 16, 2016 at 15:24
Thanks @triplee, that's weird. I used Sublime 2 to save the file "with Encoding" > UTF-8. Not including the with BOM option. Any suggestions on how I could save this file properly? — Tom
– Tom, Commented Mar 16, 2016 at 15:27
I have no immediate solution, but a hex dump of the first few bytes of the file may help reveal what exactly you have. — tripleee
– tripleee, Commented Mar 16, 2016 at 15:34

Alastair McCormack · Accepted Answer · 2016-03-16 18:58:26Z

0

Your exception is due to your source code having non-ASCII characters in it. In your case, it appears that your file has been saved as UTF-16 BE with BOM.

Unfortunately, the encoding / coding header has to come before any non-ascii, which is of course not possible as the BOM has to reside a byte 0. A catch 22 situation.

Your only choice is to change the encoding of your file to an encoding that doesn't need a BOM, such as UTF-8. In Sublime, you can simple choose: File -> Save with Encoding -> UTF-8.

On the command line, you re-encode and strip the BOM:

iconv -f UTF-16BE -t UTF-8 test42.py | tail -c +4 > test43.py

Also, heed @tripleee's comment about the CSV module in Python 2.x. Instead, use https://github.com/jdunck/python-unicodecsv, which is a Unicode compatible drop-in replacement

answered Mar 16, 2016 at 18:58

Alastair McCormack

28k8 gold badges81 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Non-ASCII Characters with Encoding Declared

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related