2

I have a problem while trying to find text specified in regex. Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...

I've tried to solve this problem with using

import sys
reload(sys)  # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")

but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.

                    k = text.encode('utf-8')
                    pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
                    pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
                    salary =  pattern.search(k).group(1)
                    print (salary)

Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?

6
  • What Python version are you using? Check if this answer works for you. Or this one. Or yet another one. Commented Nov 25, 2015 at 10:53
  • I am using Python 2.79 Commented Nov 25, 2015 at 10:55
  • i copied example from first and second answer but still doesnt work. I am working on windows, they put "# -- coding: utf-8 --" line in their script. Could I translate it somehow to windows? Commented Nov 25, 2015 at 11:00
  • Why are you encoding to bytes and then searching with a unicode pattern? Commented Nov 25, 2015 at 11:06
  • 1
    Python 3 assumes source code is in UTF-8 by default, and Py3 strings are unicode. The language is a lot cleaner than Py2 in many ways. I strongly suggest you upgrade if you can. Commented Nov 25, 2015 at 11:09

1 Answer 1

6

The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.

# coding: utf-8
string = "£"

or equivalently

string = u"\u00a3"

Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.

(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)

The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

Sign up to request clarification or add additional context in comments.

3 Comments

Adding " # coding: utf-8" helps. Great quick help. THANKS!
Of course, I was just lucky to guess that you are using UTF-8 for your script as well. On Windows, many text editors will still save files in legacy encodings (and braindamagedly call it "ANSI" which is not true or helpful at all). You have to know which encoding the file actually uses in order to get it right.
... and in fact you shoul avoid setdefaultencoding here: stackoverflow.com/questions/28657010/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.