Python UTF-8 REGEX

Question

I have a problem while trying to find text specified in regex. Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...

I've tried to solve this problem with using

import sys
reload(sys)  # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")

but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.

                    k = text.encode('utf-8')
                    pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
                    pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
                    salary =  pattern.search(k).group(1)
                    print (salary)

Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?

What Python version are you using? Check if this answer works for you. Or this one. Or yet another one. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 25, 2015 at 10:53
i copied example from first and second answer but still doesnt work. I am working on windows, they put "# -- coding: utf-8 --" line in their script. Could I translate it somehow to windows? — jawjaw
– jawjaw, Commented Nov 25, 2015 at 11:00
Why are you encoding to bytes and then searching with a unicode pattern? — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Nov 25, 2015 at 11:06
Python 3 assumes source code is in UTF-8 by default, and Py3 strings are unicode. The language is a lot cleaner than Py2 in many ways. I strongly suggest you upgrade if you can. — Tom Zych
– Tom Zych, Commented Nov 25, 2015 at 11:09

tripleee · Accepted Answer · 2015-11-25 11:59:39Z

6

The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.

# coding: utf-8
string = "£"

or equivalently

string = u"\u00a3"

Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.

(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)

The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

edited Nov 25, 2015 at 11:59

answered Nov 25, 2015 at 11:06

tripleee

192k37 gold badges318 silver badges367 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jawjaw Over a year ago

Adding " # coding: utf-8" helps. Great quick help. THANKS!

tripleee Over a year ago

Of course, I was just lucky to guess that you are using UTF-8 for your script as well. On Windows, many text editors will still save files in legacy encodings (and braindamagedly call it "ANSI" which is not true or helpful at all). You have to know which encoding the file actually uses in order to get it right.

tripleee Over a year ago

... and in fact you shoul avoid setdefaultencoding here: stackoverflow.com/questions/28657010/…

Collectives™ on Stack Overflow

Python UTF-8 REGEX

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related