7

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1

I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.

How can I force Python to use a UTF string or in some way match a character such as that?

Thanks for your help

3
  • I deleted the duplicates. Not sure why my browser submitted 3 at once... Commented Dec 16, 2008 at 17:50
  • I find that Python had a very good (not perfect) but very good support for strings in various encoding (e.g. utf-8 (it is an encoding)) as ell as Unicode (Unicode is not an encoding) strings long before Python 3 therefore don't blame language; just ask a question if you don't know how to do ... . Commented Dec 16, 2008 at 18:16
  • I wanted to pre-empt someone telling me about Python 3 or asking if I was using it. Python 2.5 is still a wonderful language and I prefer it over PHP Commented Dec 16, 2008 at 18:32

3 Answers 3

7

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.

So, for example, this:

re.compile("–") 

becomes this:

re.compile(u"\u2013")
Sign up to request clarification or add additional context in comments.

2 Comments

I was putting an r before the string for raw string
You can also add 'ur' before the string so that it's raw and Unicode.
4

After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.

# encoding: utf-8

Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6

# encoding: utf-8
import re
x = re.compile("–") 
print x.search("xxx–x").start()

Comments

3

Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.